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One must learn by doing the thing; 
for though you think you know it 
You have no certainty, until you try. 


—Sophocles, Trachiniae 


PREFACE 


The principal goal of this book is to provide a unified introduction to the theory, imple- 
mentation, and applications of statistical and adaptive signal processing methods. We have 
focused on the key topics of spectral estimation, signal modeling, adaptive filtering, and ar- 
ray processing, whose selection was based on the grounds of theoretical value and practical 
importance. The book has been primarily written with students and instructors in mind. The 
principal objectives are to provide an introduction to basic concepts and methodologies that 
can provide the foundation for further study, research, and application to new problems. 
To achieve these goals, we have focused on topics that we consider fundamental and have 
either multiple or important applications. 


APPROACH AND PREREQUISITES 


The adopted approach is intended to help both students and practicing engineers understand 
the fundamental mathematical principles underlying the operation of a method, appreciate 
its inherent limitations, and provide sufficient details for its practical implementation. The 
academic flavor of this book has been influenced by our teaching whereas its practical 
character has been shaped by our research and development activities in both academia and 
industry. The mathematical treatment throughout this book has been kept at a level that is 
within the grasp of upper-level undergraduate students, graduate students, and practicing 
electrical engineers with a background in digital signal processing, probability theory, and 
linear algebra. 


ORGANIZATION OF THE BOOK 


Chapter | introduces the basic concepts and applications of statistical and adaptive signal 
processing and provides an overview of the book. Chapters 2 and 3 review the fundamentals 
of discrete-time signal processing, study random vectors and sequences in the time and 
frequency domains, and introduce some basic concepts of estimation theory. Chapter 4 
provides a treatment of parametric linear signal models (both deterministic and stochastic) 
in the time and frequency domains. Chapter 5 presents the most practical methods for 
the estimation of correlation and spectral densities. Chapter 6 provides a detailed study 
of the theoretical properties of optimum filters, assuming that the relevant signals can be 
modeled as stochastic processes with known statistical properties; and Chapter 7 contains 
algorithms and structures for optimum filtering, signal modeling, and prediction. Chapter 


XVil 


XViii 


Preface 


8 introduces the principle of least-squares estimation and its application to the design of 
practical filters and predictors. Chapters 9, 10, and 11 use the theoretical work in Chapters 
4, 6, and 7 and the practical methods in Chapter 8, to develop, evaluate, and apply practical 
techniques for signal modeling, adaptive filtering, and array processing. Finally, Chapter 12 
introduces some advanced topics: definition and properties of higher-order moments, blind 
deconvolution and equalization, and stochastic fractional and fractal signal models with long 
memory. Appendix A contains a review of the matrix inversion lemma, Appendix B reviews 
optimization in complex space, Appendix C contains a list of the MATLAB functions used 
throughout the book, Appendix D provides a review of useful results from matrix algebra, 
and Appendix E includes a proof for the minimum-phase condition for polynomials. 


THEORY AND PRACTICE 


It is our belief that sound theoretical understanding goes hand-in-hand with practical im- 
plementation and application to real-world problems. Therefore, the book includes a large 
number of computer experiments that illustrate important concepts and help the reader 
to easily implement the various methods. Every chapter includes examples, problems, 
and computer experiments that facilitate the comprehension of the material. To help the 
reader understand the theoretical basis and limitations of the various methods and apply 
them to real-world problems, we provide MATLAB functions for all major algorithms and 
examples illustrating their use. The MATLAB files and additional material about the book can 
be found at http://www. artechhouse.com/default.asp?frame=Static/ 
manolakismatlab.html. A Solutions Manual with detailed solutions to all the prob- 
lems is available to the instructors adopting the book for classroom use. 


Dimitris G. Manolakis 
Vinay K. Ingle 
Stephen M. Kogon 


CHAPTER 1 


Introduction 


This book is an introduction to the theory and algorithms used for the analysis and pro- 
cessing of random signals and their applications to real-world problems. The fundamental 
characteristic of random signals is captured in the following statement: Although random 
signals are evolving in time in an unpredictable manner, their average statistical proper- 
ties exhibit considerable regularity. This provides the ground for the description of random 
signals using statistical averages instead of explicit equations. When we deal with random 
signals, the main objectives are the statistical description, modeling, and exploitation of the 
dependence between the values of one or more discrete-time signals and their application 
to theoretical and practical problems. 

Random signals are described mathematically by using the theory of probability, ran- 
dom variables, and stochastic processes. However, in practice we deal with random signals 
by using statistical techniques. Within this framework we can develop, at least in princi- 
ple, theoretically optimum signal processing methods that can inspire the development and 
can serve to evaluate the performance of practical statistical signal processing techniques. 
The area of adaptive signal processing involves the use of optimum and statistical signal 
processing techniques to design signal processing systems that can modify their charac- 
teristics, during normal operation (usually in real time), to achieve a clearly predefined 
application-dependent objective. 

The purpose of this chapter is twofold: to illustrate the nature of random signals with 
some typical examples and to introduce the four major application areas treated in this book: 
spectral estimation, signal modeling, adaptive filtering, and array processing. Throughout 
the book, the emphasis is on the application of techniques to actual problems in which the 
theoretical framework provides a foundation to motivate the selection of a specific method. 


1.1 RANDOM SIGNALS 


A discrete-time signal or time series is a set of observations taken sequentially in time, 
space, or some other independent variable. Examples occur in various areas, including 
engineering, natural sciences, economics, social sciences, and medicine. 

A discrete-time signal x(m) is basically a sequence of real or complex numbers called 
samples. Although the integer index n may represent any physical variable (e.g., time, 
distance), we shall generally refer to it as time. Furthermore, in this book we consider only 
time series with observations occurring at equally spaced intervals of time. 

Discrete-time signals can arise in several ways. Very often, a discrete-time signal is 
obtained by periodically sampling a continuous-time signal, that is, x(n) = x-(nT), where 
T = 1/F; (seconds) is the sampling period and F; (samples per second or hertz) is the 
sampling frequency. At other times, the samples of a discrete-time signal are obtained 
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by accumulating some quantity (which does not have an instantaneous value) over equal 
intervals of time, for example, the number of cars per day traveling on a certain road. 
Finally, some signals are inherently discrete-time, for example, daily stock market prices. 
Throughout the book, except if otherwise stated, the terms signal, time series, or sequence 
will be used to refer to a discrete-time signal. 

The key characteristics of a time series are that the observations are ordered in time and 
that adjacent observations are dependent (related). To see graphically the relation between 
the samples of a signal that are / sampling intervals away, we plot the points {x(n), x(n+J)} 
for0 < n < N —1-—l1, where N is the length of the data record. The resulting graph is 
known as the / lag scatter plot. This is illustrated in Figure 1.1, which shows a speech signal 
and two scatter plots that demonstrate the correlation between successive samples. We note 
that for adjacent samples the data points fall close to a straight line with a positive slope. 
This implies high correlation because every sample is followed by a sample with about the 
same amplitude. In contrast, samples that are 20 sampling intervals apart are much less 
correlated because the points in the scatter plot are randomly spread. 

When successive observations of the series are dependent, we may use past observations 
to predict future values. If the prediction is exact, the series is said to be deterministic. 
However, in most practical situations we cannot predict a time series exactly. Such time 
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FIGURE 1.1 
(a) The waveform for the speech signal “signal”; (b) two scatter plots for successive samples and samples 
separated by 20 sampling intervals. 


series are called random or stochastic, and the degree of their predictability is determined 
by the dependence between consecutive observations. The ultimate case of randomness 
occurs when every sample of a random signal is independent of all other samples. Such a 
signal, which is completely unpredictable, is known as white noise and is used as a building 
block to simulate random signals with different types of dependence. To summarize, the 
fundamental characteristic of a random signal is the inability to precisely specify its values. 
In other words, a random signal is not predictable, it never repeats itself, and we cannot find 
a mathematical formula that provides its values as a function of time. As a result, random 
signals can only be mathematically described by using the theory of stochastic processes 
(see Chapter 3). 

This book provides an introduction to the fundamental theory and a broad selection 
of algorithms widely used for the processing of discrete-time random signals. Signal pro- 
cessing techniques, dependent on their main objective, can be classified as follows (see 
Figure 1.2): 


e Signal analysis. The primary goal is to extract useful information that can be used to 
understand the signal generation process or extract features that can be used for signal 
classification purposes. Most of the methods in this area are treated under the disciplines 
of spectral estimation and signal modeling. Typical applications include detection and 
classification of radar and sonar targets, speech and speaker recognition, detection and 
classification of natural and artificial seismic events, event detection and classification in 
biological and financial signals, efficient signal representation for data compression, etc. 

e Signal filtering. The main objective of signal filtering is to improve the quality of a signal 
according to an acceptable criterion of performance. Signal filtering can be subdivided 
into the areas of frequency selective filtering, adaptive filtering, and array processing. 
Typical applications include noise and interference cancelation, echo cancelation, channel 
equalization, seismic deconvolution, active noise control, etc. 


We conclude this section with some examples of signals occurring in practical applications. 
Although the desciption of these signals is far from complete, we provide sufficient infor- 
mation to illustrate their random nature and significance in signal processing applications. 


Random signals 


Theory of stochastic 
processes, 
Analysis estimation, and Filtering 
optimum filtering 
(Chapters 2, 3, 6, 7) 


Spectral Signal modeling Adaptive filtering 
estimation (Chapters 4, 8, (Chapters 8, 10, 
(Chapters 5, 9) 9, 12) 12) 


FIGURE 1.2 
Classification of methods for the analysis and processing of random signals. 
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Array processing 
(Chapter 11) 


4 Speech signals. Figure 1.3 shows the spectrogram and speech waveform correspond- 

CHAPTER] ing to the utterance “signal.” The spectrogram is a visual representation of the distribution 

Introduction of the signal energy as a function of time and frequency. We note that the speech signal has 
significant changes in both amplitude level and spectral content across time. The waveform 
contains segments of voiced (quasi-periodic) sounds, such as “e,” and unvoiced or fricative 
(noiselike) sounds, such as “g.” 
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Spectrogram and acoustic waveform for the utterance “signal.” The horizontal dark bands show the resonances of the 
vocal tract, which change as a function of time depending on the sound or phoneme being produced. 


Speech production involves three processes: generation of the sound excitation, artic- 
ulation by the vocal tract, and radiation from the lips and/or nostrils. If the excitation is 
a quasi-periodic train of air pressure pulses, produced by the vibration of the vocal cords, 
the result is a voiced sound. Unvoiced sounds are produced by first creating a constriction 
in the vocal tract, usually toward the mouth end. Then we generate turbulence by forc- 
ing air through the constriction at a sufficiently high velocity. The resulting excitation is a 
broadband noiselike waveform. 

The spectrum of the excitation is shaped by the vocal tract tube, which has a frequency 
response that resembles the resonances of organ pipes or wind instruments. The resonant 
frequencies of the vocal tract tube are known as formant frequencies, or simply formants. 
Changing the shape of the vocal tract changes its frequency response and results in the 
generation of different sounds. Since the shape of the vocal tract changes slowly during 
continuous speech, we usually assume that it remains almost constant over intervals on the 
order of 10 ms. More details about speech signal generation and processing can be found 
in Rabiner and Schafer 1978; O’Shaughnessy 1987; and Rabiner and Juang 1993. 


Electrophysiological signals. Electrophysiology was established in the late eighteenth 
century when Galvani demonstrated the presence of electricity in animal tissues. Today, elec- 
trophysiological signals play a prominent role in every branch of physiology, medicine, and 


biology. Figure 1.4 shows a set of typical signals recorded in a sleep laboratory (Rechtschaf- 
fen and Kales 1968). The most prominent among them is the electroencephalogram (EEG), 
whose spectral content changes to reflect the state of alertness and the mental activity of 
the subject. The EEG signal exhibits some distinctive waves, known as rhythms, whose 
dominant spectral content occupies certain bands as follows: delta (5), 0.5 to 4 Hz; theta 
(0), 4 to 8 Hz; alpha (q@), 8 to 13 Hz; beta (6), 13 to 22 Hz; and gamma (vy), 22 to 30 Hz. 
During sleep, if the subject is dreaming, the EEG signal shows rapid low-amplitude fluctu- 
ations similar to those obtained in alert subjects, and this is known as rapid eye movement 
(REM) sleep. Some other interesting features occurring during nondreaming sleep periods 
resemble alphalike activity and are known as sleep spindles. More details can be found in 
Duffy et al. 1989 and Niedermeyer and Lopes Da Silva 1998. 
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FIGURE 1.4 

Typical sleep laboratory recordings. The two top signals show eye movements, the next one 
illustrates EMG (electromyogram) or muscle tonus, and the last one illustrates brain waves 
(EEG) during the onset of a REM sleep period (from Rechtschaffen and Kales 1968). 


The beat-to-beat fluctuations in heart rate and other cardiovascular variables, such as ar- 
terial blood pressure and stroke volume, are mediated by the joint activity of the sympathetic 
and parasympathetic systems. Figure 1.5 shows time series for the heart rate and systolic ar- 
terial blood pressure. We note that both heart rate and blood pressure fluctuate in a complex 
manner that depends on the mental or physiological state of the subject. The individual or 
joint analysis of such time series can help to understand the operation of the cardiovascular 
system, predict cardiovascular diseases, and help in the development of drugs and devices 
for cardiac-related problems (Grossman et al. 1996; Malik and Camm 1995; Saul 1990). 


Geophysical signals. Remote sensing systems use a variety of electro-optical sensors 
that span the infrared, visible, and ultraviolet regions of the spectrum and find many civilian 
and defense applications. Figure 1.6 shows two segments of infrared scans obtained by a 
space-based radiometer looking down at earth (Manolakis et al. 1994). The shape of the 
profiles depends on the transmission properties of the atmosphere and the objects in the 
radiometer’s field-of-view (terrain or sky background). The statistical characterization and 
modeling of infrared backgrounds are critical for the design of systems to detect missiles 
against such backgrounds as earth’s limb, auroras, and deep-space star fields (Sabins 1987; 
Colwell 1983). Other geophysical signals of interest are recordings of natural and man-made 
seismic events and seismic signals used in geophysical prospecting (Bolt 1993; Dobrin 1988; 
Sheriff 1994). 
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FIGURE 1.5 
Simultaneous recordings of the heart rate and systolic blood pressure signals for a 
subject at rest. 
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FIGURE 1.6 
Time series of infrared radiation measurements obtained by a scanning radiometer. 


Radar signals. We conveniently define a radar system to consist of both a transmitter 
and a receiver. When the transmitter and receiver are colocated, the radar system is said to 
be monostatic, whereas if they are spatially separated, the system is bistatic. The radar first 
transmits a waveform, which propagates through space as electromagnetic energy, and then 
measures the energy returned to the radar via reflections. When the returns are due to an 
object of interest, the signal is known as a target, while undesired reflections from the earth’s 
surface are referred to as clutter. In addition, the radar may encounter energy transmitted by 
a hostile opponent attempting to jam the radar and prevent detection of certain targets. Col- 
lectively, clutter and jamming signals are referred to as interference. The challenge facing 
the radar system is how to extract the targets of interest in the presence of sometimes severe 
interference environments. Target detection is accomplished by using adaptive processing 
methods that exploit characteristics of the interference in order to suppress these undesired 
signals. 

A transmitted radar signal propagates through space as electromagnetic energy at ap- 
proximately the speed of light c = 3 x 10° m/s. The signal travels until it encounters an 
object that reflects the signal’s energy. A portion of the reflected energy returns to the radar 
receiver along the same path. The round-trip delay of the reflected signal determines the 
distance or range of the object from the radar. The radar has a certain receive aperture, 
either a continuous aperture or one made up of a series of sensors. The relative delay of a 
signal as it propagates across the radar aperture determines its angle of arrival, or bearing. 
The extent of the aperture determines the accuracy to which the radar can determine the 
direction of a target. Typically, the radar transmits a series of pulses at a rate known as the 
pulse repetition frequency. Any target motion produces a phase shift in the returns from 
successive pulses caused by the Doppler effect. This phase shift across the series of pulses 
is known as the Doppler frequency of the target, which in turn determines the target radial 
velocity. The collection of these various parameters (range, angle, and velocity) allows the 
radar to locate and track a target. 

An example of a radar signal as a function of range in kilometers (km) is shown in 
Figure 1.7. The signal is made up of a target, clutter, and thermal noise. All the signals have 
been normalized with respect to the thermal noise floor. Therefore, the normalized noise 
has unit variance (0 dB). The target signal is at a range of 100 km with a signal-to-noise 
ratio (SNR) of 15 dB. The clutter, on the other hand, is present at all ranges and is highly 
nonstationary. Its power levels vary from approximately 40 dB at near ranges down to the 
thermal noise floor (0 dB) at far ranges. Part of the nonstationarity in the clutter is due to 
the range falloff of the clutter as its power is attenuated as a function of range. However, the 
rises and dips present between 100 and 200 km are due to terrain-specific artifacts. Clearly, 
the target is not visible, and the clutter interference must be removed or canceled in order 
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FIGURE 1.7 
Example of a radar return signal, plotted as relative power with 
respect to noise versus range. 
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to detect the target. The challenge here is how to cancel such a nonstationary signal in order 
to extract the target signal and motivate the use of adaptive techniques that can adapt to 
the rapidly changing interference environment. More details about radar and radar signal 
processing can be found in Skolnik 1980; Skolnik 1990; and Nathanson 1991. 


1.2 SPECTRAL ESTIMATION 


The central objective of signal analysis is the development of quantitative techniques to 
study the properties of a signal and the differences and similarities between two or more 
signals from the same or different sources. The major areas of random signal analysis 
are (1) statistical analysis of signal amplitude (i.e., the sample values); (2) analysis and 
modeling of the correlation among the samples of an individual signal; and (3) joint signal 
analysis (i.e., simultaneous analysis of two signals in order to investigate their interaction or 
interrelationships). These techniques are summarized in Figure 1.8. The prominent tool in 
signal analysis is spectral estimation, which is a generic term for a multitude of techniques 
used to estimate the distribution of energy or power of a signal from a set of observations. 
Spectral estimation is a very complicated process that requires a deep understanding of 
the underlying theory and a great deal of practical experience. Spectral analysis finds many 
applications in areas such as medical diagnosis, speech analysis, seismology and geophysics, 
radar and sonar, nondestructive fault detection, testing of physical theories, and evaluating 
the predictability of time series. 
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FIGURE 1.8 
Summary of random signal analysis techniques. 


Amplitude distribution. The range of values taken by the samples of a signal and how 
often the signal assumes these values together determine the signal variability. The signal 
variability can be seen by plotting the time series and is quantified by the histogram of the 
signal samples, which shows the percentage of the signal amplitude values within a certain 
range. The numerical description of signal variability, which depends only on the value 
of the signal samples and not on their ordering, involves quantities such as mean value, 
median, variance, and dynamic range. 


Figure 1.9 shows the one-step increments, that is, the first difference xg(n) = x(n) — 9 
x(n—1), or approximate derivative of the infrared signals shown in Figure 1.6, whereas Fig- section 1.2 _ 
ure 1.10 shows their histograms. Careful examination of the shape of the histogram curves Spectral Estimation 
indicates that the second signal jumps quite frequently between consecutive samples with 
large steps. In other words, the probability of large increments is significant, as exemplified 
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One-step-increment time series for the infrared data shown in Figure 1.6. 
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Histograms for the infrared increment signals. 
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by the fat tails of the histogram in Figure 1.10(b). The knowledge of the probability of 
extreme values is essential in the design of detection systems for digital communications, 
military surveillance using infrared and radar sensors, and intensive care monitoring. In 
general, the shape of the histogram, or more precisely the probability density, is very im- 
portant in applications such as signal coding and event detection. Although many practical 
signals follow a Gaussian distribution, many other signals of practical interest have distri- 
butions that are non-Gaussian. For example, speech signals have a probability density that 
can be reasonably approximated by a gamma distribution (Rabiner and Schafer 1978). 

The significance of the Gaussian distribution in signal processing stems from the fol- 
lowing facts. First, many physical signals can be described by Gaussian processes. Second, 
the central limit theorem (see Chapter 3) states that any process that is the result of the 
combination of many elementary processes will tend, under quite general conditions, to be 
Gaussian. Finally, linear systems preserve the Gaussianity of their input signals. To under- 
stand the last two statements, consider N independent random quantities x1, x2,...,xN 
with the same probability density p(x) and pose the following question: When does the 
probability distribution py (x) of their sum x = x; + x2 +---+ xy have the same shape 
(within a scale factor) as the distribution p(x) of the individual quantities? The standard 
answer is that p(x) should be Gaussian, because the sum of N Gaussian random variables 
is again a Gaussian, but with variance equal to N times that of the individual signals. How- 
ever, if we allow for distributions with infinite variance, additional solutions are possible. 
The resulting probability distributions, known as stable or Levy distributions, have infinite 
variance and are characterized by a thin main lobe and fat tails, resembling the shape of 
the histogram in Figure 1.10(b). Interestingly enough, the Gaussian distribution is a stable 
distribution with finite variance (actually the only one). Because Gaussian and stable non- 
Gaussian distributions are invariant under linear signal processing operations, they are very 
important in signal processing. 


Correlation and spectral analysis. Although scatter plots (see Figure 1.1) illustrate 
nicely the existence of correlation, to obtain quantitative information about the correlation 
structure of a time series x(m) with zero mean value, we use the empirical normalized 
autocorrelation sequence 

N-1 


Yo x@)x*(n — 1) 


ns n=l 
sO= = (1.2.1) 


So bx@)/? 


n=0 


which is an estimate of the theoretical normalized autocorrelation sequence. For lag / = 0, 
the sequence is perfectly correlated with itself and we get the maximum value of 1. If 
the sequence does not change significantly from sample to sample, the correlation of the 
sequence with its shifted copies, though diminished, is still close to 1. Usually, the correlation 
decreases as the lag increases because distant samples become less and less dependent. Note 
that reordering the samples of a time series changes its autocorrelation but not its histogram. 

We say that signals whose empirical autocorrelation decays fast, such as an exponential, 
have short-memory or short-range dependence. If the empirical autocorrelation decays very 
slowly, as a hyperbolic function does, we say that the signal has long-memory or long-range 
dependence. These concepts will be formulated in a theoretical framework in Chapter 3. 
Furthermore, we shall see in the next section that effective modeling of time series with 
short or long memory requires different types of models. 

The spectral density function shows the distribution of signal power or energy as a 
function of frequency (see Figure 1.11). The autocorrelation and the spectral density of a 
signal form a Fourier transform pair and hence contain the same information. However, 
they present this information in different forms, and one can reveal information that cannot 
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FIGURE 1.11 
Illustration of the concept of power or energy spectral density function of a random signal. 


be easily extracted from the other. It is fair to say that the spectral density is more widely 
used than the autocorrelation. 

Although the correlation and spectral density functions are the most widely used tools 
for signal analysis, there are applications that require the use of correlations among three or 
more samples and the corresponding spectral densities. These quantities, which are useful 
when we deal with non-Gaussian processes and nonlinear systems, belong to the area of 
higher-order statistics and are described in Chapter 12. 


Joint signal analysis. In many applications, we are interested in the relationship be- 
tween two different random signals. There are two cases of interest. In the first case, the 
two signals are of the same or similar nature, and we want to ascertain and describe the 
similarity or interaction between them. For example, we may want to investigate if there is 
any similarity in the fluctuation of infrared radiation in the two profiles of Figure 1.6. 

In the second case, we may have reason to believe that there is a causal relationship 
between the two signals. For example, one signal may be the input to a system and the 
other signal the output. The task in this case is to find an accurate description of the system, 
that is, a description that allows accurate estimation of future values of the output from the 
input. This process is known as system modeling or identification and has many practical 
applications, including understanding the operation of a system in order to improve the 
design of new systems or to achieve better control of existing systems. 

In this book, we will study joint signal analysis techniques that can be used to understand 
the dynamic behavior between two or more signals. An interesting example involves using 
signals, like the ones in Figure 1.5, to see if there is any coupling between blood pressure 
and heart rate. Some interesting results regarding the effect of respiration and blood pressure 
on heart rate are discussed in Chapter 5. 


1.3 SIGNAL MODELING 


In many theoretical and practical applications, we are interested in generating random sig- 
nals with certain properties or obtaining an efficient representation of real-world random 
signals that captures a desired set of their characteristics (e.g., correlation or spectral fea- 
tures) in the best possible way. We use the term model to refer to a mathematical description 
that provides an efficient representation of the “essential” properties of a signal. 

For example, a finite segment {x(n) }N of any signal can be approximated by a linear 
combination of constant (A, = 1) or exponentially fading (0 < Ax < 1) sinusoids 


M 
x(n) X So agri cos (wun + $x) (1.3.1) 
k=1 


where {az, Ax, Wk, rae ,; are the model parameters. A good model should provide an 
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accurate description of the signal with4M < N parameters. From a practical viewpoint, we 
are most interested in parametric models, which assume a given functional form completely 
specified by a finite number of parameters. In contrast, nonparametric models do not put 
any restriction on the functional form or the number of model parameters. 

If any of the model parameters in (1.3.1) is random, the result is a random signal. The 
most widely used model is given by 


M 
x(n) = » ag cos (wen + $y) 
k=1 


where the amplitudes {ax} and the frequencies {ox} are constants and the phases {oi} 
are random. This model is known as the harmonic process model and has many theoretical 
and practical applications (see Chapters 3 and 9). 

Suppose next that we are given a sequence w(n) of independent and identically dis- 
tributed observations. We can create a time series x(m) with dependent observations, by 
linearly combining the values of w(n) as 

[o,@) 
x(n) = > h(k)w(n — k) (1.3.2) 


k=—0o 


which results in the widely used linear random signal model. The model specified by the 
convolution summation (1.3.2) is clearly nonparametric because, in general, it depends on 
an infinite number of parameters. Furthermore, the model is a linear, time-invariant system 
with impulse response h(k) that determines the memory of the model and, therefore, the 
dependence properties of the output x(n). By properly choosing the weights h(k), we can 
generate a time series with almost any type of dependence among its samples. 

In practical applications, we are interested in linear parametric models. As we will see, 
parametric models exhibit a dependence imposed by their structure. However, if the number 
of parameters approaches the range of the dependence (in number of samples), the model 
can mimic any form of dependence. The list of desired features for a good model includes 
these: (1) the number of model parameters should be as small as possible (parsimony), 
(2) estimation of the model parameters from the data should be easy, and (3) the model 
parameters should have a physically meaningful interpretation. 

If we can develop a successful parametric model for the behavior of a signal, then we 
can use the model for various applications: 


1. To achieve a better understanding of the physical mechanism generating the signal (e.g., 
earth structure in the case of seismograms). 

2. To track changes in the source of the signal and help identify their cause (e.g., EEG). 

3. To synthesize artificial signals similar to the natural ones (e.g., speech, infrared back- 
grounds, natural scenes, data network traffic). 

4. To extract parameters for pattern recognition applications (e.g., speech and character 
recognition). 

5. To get an efficient representation of signals for data compression (e.g., speech, audio, 
and video coding). 

6. To forecast future signal behavior (e.g., stock market indexes) (Pindyck and Rubinfeld 
1998). 


In practice, signal modeling involves the following steps: (1) selection of an appropriate 
model, (2) selection of the “right” number of parameters, (3) fitting of the model to the 
actual data, and (4) model testing to see if the model satisfies the user requirements for the 
particular application. As we shall see in Chapter 9, this process is very complicated and 
depends heavily on the understanding of the theoretical model properties (see Chapter 4), 
the amount of familiarity with the particular application, and the experience of the user. 


1.3.1 Rational or Pole-Zero Models 


Suppose that a given sample x(n), at time n, can be approximated by the previous sample 
weighted by a coefficient a, that is, x(n) * ax(n — 1), where a is assumed constant over the 
signal segment to be modeled. To make the above relationship exact, we add an excitation 
term w(n), resulting in 


x(n) = ax(n—1)+w(n) (1.3.3) 
where w(7) is an excitation sequence. Taking the z-transform of both sides (discussed in 
Chapter 2), we have 


X(z) = az !X(z) + Wz) (1.3.4) 
which results in the following system function: 
X(z) 1 
A(z) = = 1.3.5 
@) Wz) 1—az! 2?) 
By using the identity 
1 
A(z) = ——— =lt+az}+a2zg*+--- -l«<a<l (1.3.6) 
1— az! 


the single-parameter model in (1.3.3) can be expressed in the following nonparametric form 
x(n) = w(n) +aw(n — ba wn 2) Foes (1.3.7) 
which clearly indicates that the model generates a time series with exponentially decaying 
dependence. 
A more general model can be obtained by including a linear combination of the P 
previous values of the signal and of the Q previous values of the excitation in (1.3.3), that 
iS, 


P Q 

x(n) = So (ay) x(n =k) + Yo dew(n =k) (1.3.8) 
k=1 k=0 
The resulting system function : 
y dyz—* 
X(z) k=0 
H(z) = 1.3.9 
(z) Te F (1.3.9) 
1+ Ss ayz—* 
k=1 


is rational, that is, a ratio of two polynomials in the variable z!, hence the term rational 
models. We will show in Chapter 4 that any rational model has a dependence structure or 
memory that decays exponentially with time. Because the roots of the numerator polynomial 
are known as zeros and the roots of the denominator polynomial as poles, these models are 
also known as pole-zero models. In the time-series analysis literature, these models are 
known as autoregressive moving-average (ARMA) models. 


Modeling the vocal tract. An example of the application of the pole-zero model is for 
the characterization of the speech production system. Most generally, speech sounds are 
classified as either voiced or unvoiced. For both of these types of speech, the production is 
modeled by exciting a linear system, the vocal tract, with an excitation having a flat, that 
is, constant, spectrum. The vocal tract, in turn, is modeled by using a pole-zero system, 
with the poles modeling the vocal tract resonances and the zeros serving the purpose of 
dampening the spectral response between pole frequencies. In the case of voiced speech, 
the input to the vocal tract model is a quasi-periodic pulse waveform, whereas for unvoiced 
speech the source is modeled as random noise. The system model of the speech production 
process is shown in Figure 1.12. The parameters of this model are the voiced/unvoiced 


13 


SECTION 1.3 
Signal Modeling 


14 


CHAPTER | 


Introduction 


Pitch 
period 


Impulse tay, dx} 


train 
generator 


Voiced /unvoiced 


switch Vocal tract 


parameters 


Pole-zero 
digital filter 


Random 


noise 
generator 


Gain 


FIGURE 1.12 
Speech synthesis system based on pole-zero modeling. 


classification, the pitch period for voiced sounds, the gain parameter, and the coefficients 
{dx} and {ax} of the vocal tract filter (1.3.9). This model is widely used for low-bit-rate (less 
than 2.4 kbits/s) speech coding, synthetic speech generation, and extraction of features 
for speech and speaker recognition (Rabiner and Schafer 1978; Rabiner and Juang 1993; 
Furui 1989). 


1.3.2 Fractional Pole-Zero Models and Fractal Models 


Although the dependence in (1.3.7) becomes stronger as the polea — 1, it cannot effectively 
model time series whose autocorrelation decays asymptotically as a power law. For a = 1, 
that is, for a pole on the unit circle (unit pole), we obtain an everlasting constant dependence, 
but the output of the model increases without limit and the model is said to be unstable. 
However, we can obtain a stable model with long memory by creating a fractional unit 
pole, that is, by raising (1.3.6) by a fractional power. Indeed, using the identity 


1 = d(d+1) _, 1 1 
H(z) = ———— = 1+dz7!4+ ——— oo —--=<d<- (1.3.10 
OS Gaggia 2b toe Stig 
d(d +1) 
we have x(n) = w(n) + dw(n— 1) + ay w(n —2)+--- (1.3.11) 


The weights /g(n) in (1.3.11) decay according to n¢—! as n —> oo; that is, the depen- 
dence decays asymptotically as a power law or hyperbolically. Even if the model (1.3.11) is 
specified by one parameter, its implementation involves an infinite-order convolution sum- 
mation. Therefore, its practical realization requires an approximation by a rational model 
that can be easily implemented by using a difference equation. If w(m) is a sequence of 
independent Gaussian random variables, the process generated by (1.3.11) is known as 
fractionally differenced Gaussian noise. Rational models including one or more fractional 
poles are known in time-series analysis as fractional autoregressive integrated moving- 
average models and are studied in Chapter 12. The short-term dependence of these models 
is exponential, whereas their long-term dependence is hyperbolic. 

In continuous time, we can create long dependence by using a fractional pole. This is 
illustrated by the following Laplace transform pair 


LitP"} x 5 B>0 (1.3.12) 


which corresponds to an integrator for 8 = | and a fractional integrator for 0 < B < 1. 
Clearly, the memory of a continuous-time system with impulse response hg(t) = tP—! for 


t > O and hg(t) = 0 for t < O decays hyperbolically. The response of such a system to 
white Gaussian noise results in a nonstationary process called fractional Brownian motion. 
Sampling the fractional Brownian motion process at equal intervals and computing the one- 
step increments result in a stationary discrete-time process known as fractional Gaussian 
noise. Both processes exhibit long memory and are of great theoretical and practical interest 
and their properties and applications are discussed in Chapter 12. 

Exciting a rational model with fractional Gaussian noise leads to a very flexible class 
of models that exhibit exponential short-range dependence and hyperbolic long-range de- 
pendence. The excitation of fractional models (either discrete-time or continuous-time) 
with statistically independent inputs whose amplitude changes are distributed according 
to a stable probability law leads to random signal models with long dependence and high 
amplitude variability. Such models have many practical applications and are also discussed 
in Chapter 12. 

If we can reproduce an object by magnifying some portion of it, we say that the object 
is scale-invariant or self-similar. Thus, self-similarity is invariance with respect to scaling. 
Self-similar geometric objects are known as fractals. More specifically, a signal x(f) is self- 
similar if x(ct) = c x(t) for some c > 0. The constant H is known as the self-similarity 
index. It can easily be seen that a signal described by a power law, say, x(t) = at, is self- 
similar. However, such signals are of limited interest. A more interesting and useful type 
of signal is one that exhibits a weaker statistical version of self-similarity. A random signal 
is called (statistically) self-similar if its statistical properties are scale-invariant, that is, its 
statistics do not change under magnification or minification. Self-similar random signals are 
also known as random fractals. Figure 1.13 provides a visual illustration of the self-similar 
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FIGURE 1.13 

Pictorial illustration of self-similarity for the variable bit rate video traffic time series. The 
bottom series is obtained from the top series by expanding the segment between the two 
vertical lines. Although the two series have lengths of 600 and 60 s, they are remarkably 
similar visually and statistically (Courtesy of M. Garrett and M. Vetterli). 
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behavior of the variable bit rate video traffic time series. The analysis and modeling of such 
time series find extensive applications in Internet traffic applications (Michiel and Laevens 
1997; Garrett and Willinger 1994). 

A classification of the various signal models described previously is given in Figure 
1.14, which also provides information about the chapters of the book where these signals 
are discussed. 
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FIGURE 1.14 
Classification of random signal models. 


1.4 ADAPTIVE FILTERING 


Conventional frequency-selective digital filters with fixed coefficients are designed to have 
a given frequency response chosen to alter the spectrum of the input signal in a desired 
manner. Their key features are as follows: 


1. The filters are linear and time-invariant. 

2. The design procedure uses the desired passband, transition bands, passband ripple, and 
stopband attenuation. We do not need to know the sample values of the signals to be 
processed. 

3. Since the filters are frequency-selective, they work best when the various components 
of the input signal occupy nonoverlapping frequency bands. For example, it is easy to 
separate a signal and additive noise when their spectra do not overlap. 

4. The filter coefficients are chosen during the design phase and are held constant during 
the normal operation of the filter. 


However, there are many practical application problems that cannot be successfully 
solved by using fixed digital filters because either we do not have sufficient information to 
design a digital filter with fixed coefficients or the design criteria change during the normal 
operation of the filter. Most of these applications can be successfully solved by using special 
“smart” filters known collectively as adaptive filters. The distinguishing feature of adaptive 
filters is that they can modify their response to improve performance during operation 
without any intervention from the user. 


1.4.1 Applications of Adaptive Filters 


The best way to introduce the concept of adaptive filtering is by describing some typical 
application problems that can be effectively solved by using an adaptive filter. The ap- 
plications of adaptive filters can be sorted for convenience into four classes: (1) system 
identification, (2) system inversion, (3) signal prediction, and (4) multisensor interference 
cancelation (see Figure 1.15 and Table 1.1). We next describe each class of applications and 
provide a typical example for each case. 


TABLE 1.1 
Classification of adaptive filtering applications. 


Application class Examples 


System identification Echo cancelation 
Adaptive control 
Channel modeling 


System inversion Adaptive equalization 
Blind deconvolution 


Signal prediction Adaptive predictive coding 
Change detection 
Radio frequency interference cancelation 


Multisensor interference cancelation Acoustic noise control 
Adaptive beamforming 


System Identification 


This class of applications, known also as system modeling, is illustrated in Figure 
1.15(a). The system to be modeled can be either real, as in control system applications, 
or some hypothetical signal transmission path (e.g., the echo path). The distinguishing 
characteristic of the system identification application is that the input of the adaptive filter 
is noise-free and the desired response is corrupted by additive noise that is uncorrelated with 
the input signal. Applications in this class include echo cancelation, channel modeling, and 
identification of systems for control applications (Gitlin et al. 1992; Ljung 1987; Astrém and 
Wittenmark 1990). In control applications, the purpose of the adaptive filter is to estimate 
the parameters or the state of the system and then to use this information to design a 
controller. In signal processing applications, the goal is to obtain a good estimate of the 
desired response according to the adopted criterion of performance. 


Acoustic echo cancelation. Figure 1.16 shows atypical audio teleconferencing system 
that helps two groups of people, located at two different places, to communicate effectively. 
However, the performance of this system is degraded by the following effects: (1) The 
reverberations of the room result from the fact that the microphone picks up not only the 
speech coming from the talker but also reflections from the walls and furniture in the room. 
(2) Echoes are created by the acoustic coupling between the microphone and the loudspeaker 
located in the same room. Speech from room B not only is heard by the listener in room A 
but also is picked up by the microphone in room A, and unless it is prevented, will return 
as an echo to the speaker in room B. 

Several methods to deal with acoustic echoes have been developed. However, the most 
effective technique to prevent or control echoes is adaptive echo cancelation. The basic idea 
is very simple: To cancel the echo, we generate a replica or pseudo-echo and then subract 
it from the real echo. To synthesize the echo replica, we pass the signal at the loudspeaker 
through a device designed to duplicate the reverberation and echo properties of the room 
(echo path), as is illustrated in Figure 1.17. 
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FIGURE 1.15 
The four basic classes of adaptive filtering applications: (a) system identification, (b) 


system inversion, (c) signal prediction, and (d) multisensor interference cancelation. 
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FIGURE 1.16 
Typical teleconferencing system without echo control. 
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FIGURE 1.17 
Principle of acoustic echo cancelation using an adaptive echo canceler. 


In practice, there are two obstables to this approach. (1) The echo path is usually 
unknown before actual transmission begins and is quite complex to model. (2) The echo 
path is changing with time, since even the move of a talker alters the acoustic properties 
of the room. Therefore, we cannot design and use a fixed echo canceler with satisfactory 
performance for all possible connections. There are two possible ways around this problem: 


1. Design a compromise fixed echo canceler based on some “average” echo path, assuming 
that we have sufficient information about the connections to be seen by the canceler. 

2. Design an adaptive echo canceler that can “learn” the echo path when it is first turned on 
and afterward “tracks” its variations without any intervention from the designer. Since 
an adaptive canceler matches the echo patch for any given connection, it performs better 
than a fixed compromise canceler. 


We stress that the main task of the canceler is to estimate the echo signal with sufficient 
accuracy; the estimation of the echo path is simply the means for achieving this goal. The 
performance of the canceler is measured by the attenuation of the echo. The adaptive echo 
canceler achieves this goal, by modifying its response, using the residual echo signal in 
an as-yet-unspecified way. More details about acoustic echo cancelation can be found in 
Gilloire et al. (1996). 


System inversion 


This class of applications, which is illustrated in Figure 1.15(b), is also known as inverse 
system modeling. The goal of the adaptive filter is to estimate and apply the inverse of the 
system. Dependent on the application, the input of the adaptive filter may be corrupted by 
additive noise, and the desired response may not be available. The existence of the inverse 
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system and its properties (e.g., causality and stability) creates additional complications. 
Typical applications include adaptive equalization (Gitlin et al. 1992), seismic deconvolu- 
tion (Robinson 1984), and adaptive inverse control (Widrow and Walach 1994). 


Channel equalization. To understand the basic principles of the channel equalization 
techniques, we consider a binary data communication system that transmits a band-limited 
analog pulse with amplitudes A (symbol 1) or —A (symbol 0) every 7) s (see Figure 1.18). 
Here T; is known as the symbol interval and Rp = 1/T) as the baud rate. As the signal 
propagates through the channel, it is delayed and attenuated in a frequency-dependent 
manner. Furthermore, it is corrupted by additive noise and other natural or man-made 
interferences. The goal of the receiver is to measure the amplitude of each arriving pulse 
and to determine which one of the two possible pulses has been sent. The received signal is 
sampled once per symbol interval after filtering, automatic gain control, and carrier removal. 
The sampling time is adjusted to coincide with the “center” of the received pulse. The shape 
of the pulse is chosen to attain the maximum rate at which the receiver can still distinguish 
the different pulses. To achieve this goal, we usually choose a band-limited pulse that has 
periodic zero crossings every Typ s. 


Noise 


Data 


Transmitter Recovered data 


Channel Receiver 


Interference 


FIGURE 1.18 
Simple model of a digital communications system. 


If the periodic zero crossings of the pulse are preserved after transmission and reception, 
we can measure its amplitude without interference from overlapping adjacent pulses. How- 
ever, channels that deviate from the ideal response (constant magnitude and linear phase) 
destroy the periodic zero-crossing property and the shape of the peak of the pulse. As a 
result, the tails of adjacent pulses interfere with the measurement of the current pulse and 
can lead to an incorrect decision. This type of degradation, which is known as intersymbol 
interference (ISI), is illustrated in Figure 1.19. 


FIGURE 1.19 

Pulse trains (a) without 
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with intersymbol interference. 


No intersymbol 
interference 


Intersymbol 
interference 


(b) Distorted pulses 


We can compensate for the ISI distortion by using a linear filter called an equalizer. The 
goal of the equalizer is to restore the received pulse, as closely as possible, to its original 
shape. The equalizer transforms the channel to a near-ideal one if its response resembles 
the inverse of the channel. Since the channel is unknown and possibly time-varying, there 
are two ways to approach the problem: (1) Design a fixed compromise equalizer to obtain 
satisfactory performance over a broad range of channels, or (2) design an equalizer that can 
“learn” the inverse of the particular channel and then “track” its variation in real time. 

The characteristics of the equalizer are adjusted by some algorithm that attempts to 
attain the best possible performance. The most appropriate criterion of performance for 
data transmission systems is the probability of symbol error. However it cannot be used for 
two reasons: (1) The “correct” symbol is unknown to the receiver (otherwise there would 
be no reason to communicate), and (2) the number of decisions (observations) needed to 
estimate the low probabilities of error is extremely large. Thus, practical equalizers assess 
their performance by using some function of the difference between the “correct” symbol 
and the output. The operation of practical equalizers involves two modes’ of operation, 
dependent on how we substitute for the unavailable correct symbol sequence. (1) A known 
training sequence is transmitted, and the equalizer attempts to improve its performance 
by comparing its output to a synchronized replica of the training sequence stored at the 
receiver. Usually this mode is used when the equalizer starts a transmission session. (2) At 
the end of the training session, when the equalizer starts making reliable decisions, we can 
replace the training sequence with the equalizer’s own decisions. 

Adaptive equalization is a mature technology that has had the greatest impact on digital 
communications systems, including voiceband, microwave and troposcatter radio, and cable 
TV modems (Qureshi 1985; Lee and Messerschmitt 1994; Gitlin et al. 1992; Bingham 1988; 
Treichler et al. 1996). 


Signal prediction 


In the next class of applications, the goal is to estimate the value x(mo) of a random 
signal by using a set of consecutive signal samples {x(n),n1 <n < nz}. There are three 
cases of interest: (1) forward prediction, when ng > n2; (2) backward “prediction,” when 
ng < ny; and (3) smoothing or interpolation, when n1 < no < nz. Clearly, in the last case 
the value at n = no is not used in the computation of the estimate. The most widely used 
type is forward linear prediction or simply linear prediction’ [see Figure 1.15(c)], where 
the estimate is formed by using a linear combination of past samples (Makhoul 1975). 


Linear predictive coding (LPC). The efficient storage and transmission of analog sig- 
nals using digital systems requires the minimization of the number of bits necessary to 
represent the signal while maintaining the quality to an acceptable level according to a cer- 
tain criterion of performance. The conversion of an analog (continuous-time, continuous- 
amplitude) signal to a digital (discrete-time, discrete-amplitude) signal involves two pro- 
cesses: sampling and quantization. Sampling converts a continuous-time signal to a discrete- 
time signal by measuring its amplitude at equidistant intervals of time. Quantization involves 
the representation of the measured continuous amplitude using a finite number of symbols 
and always creates some amount of distortion (quantization noise). 

For a fixed number of bits, decreasing the dynamic range of the signal (and therefore the 
range of the quantizer) decreases the required quantization step and therefore the average 
quantization error power. Therefore, we can decrease the quantization noise by reducing 
the dynamic range or equivalently the variance of the signal. If the signal samples are 


* another mode of operation, where the equalizer can operate without the benefit of a training sequence (blind or 
self-recovering mode), is discussed in Chapter 12. 

FAs we shall see in Chapters 4 and 6, linear prediction is closely related, but not identical, to all-pole signal 
modeling. 
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significantly correlated, the variance of the difference between adjacent samples is smaller 
than the variance of the original signal. Thus, we can improve quality by quantizing this 
difference instead of the original signal. This idea is exploited by the linear prediction system 
shown in Figure 1.20. This system uses a linear predictor to form an estimate (prediction) 
X(n) of the present sample x(n) as a linear combination of the M past samples, that is, 


M 
&(n) = Do agx(n — k) (1.4.1) 


k=1 


The coefficients {a,}" of the linear predictor are determined by exploiting the correlation 
between adjacent samples of the input signal with the objective of making the prediction 
error 


e(n) = x(n) — x(n) (1.4.2) 


as small as possible. If the prediction is good, the dynamic range of e(7) should be smaller 
than the dynamic range of x (77), resulting in a smaller quantization noise for the same number 
of bits or the same quantization noise with a smaller number of bits. The performance of 
the LPC system depends on the accuracy of the predictor. Since the statistical properties 
of the signal x(n) are unknown and change with time, we cannot design an optimum 
fixed predictor. The established practical solution is to use an adaptive linear predictor that 
automatically adjusts its coefficients to compute a “good” prediction at each time instant. 
A detailed discussion of adaptive linear prediction and its application to audio, speech, and 
video signal coding is provided in Jayant and Noll (1984). 


\<— M samples —>| FIGURE 1.20 
Illustration of the linear prediction of 
R(n) a signal x(n) using a finite number of 
past samples. 
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Multisensor interference cancelation 


The key feature of this class of applications is the use of multiple sensors to remove 
undesired interference and noise. Typically, a primary signal contains both the signal of 
interest and the interference. Other signals, known as reference signals, are available for 
the purposes of canceling the undesired interference [see Figure 1.15(d)]. These reference 
signals are collected using other sensors in which the signal of interest is not present or is so 
weak that it can be ignored. The amount of correlation between the primary and reference 
signals is measured and used to form an estimate of the interference in the primary signal, 
which is subsequently removed. Had the signal of interest been present in the reference 
signal(s), then this process would have resulted in the removal of the desired signal as 
well. Typical applications in which interference cancelation is employed include array 
processing for radar and communications, biomedical sensing systems, and active noise 
control (Widrow et al. 1975; Kuo and Morgan 1996). 


Active noise control (ANC). The basic idea behind an ANC system is the cancelation 
of acoustic noise using destructive wave interference. To create destructive interference that 
cancels an acoustic noise wave (primary) at a point P, we can use a loudspeaker that creates, 
at the same point P, another wave (secondary) with the same frequency, the same amplitude, 
and 180° phase difference. Therefore, with appropriate control of the peaks and troughs 


of the secondary wave, we can produce zones of destructive interference (quietness). ANC 
systems using digital signal processing technology find applications in air-conditioning 
ducts, aircraft, cars, and magnetic resonance imaging (MRI) systems (Elliott and Nelson 
1993; Kuo and Morgan 1996). 

Figure 1.21 shows the key components of an adaptive ANC system described in Craw- 
ford et al. 1997. The task of the loudspeaker is to generate an acoustic wave that is an 180° 
phase-inverted version of the signal y(t) when it arrives at the error microphone. In this 
case the error signal e(t) = y(t) + 3(t) = 0, and we create a “quiet zone” around the 
microphone. If the acoustic paths (1) from the noise source to the reference microphone 
(G,), (2) from the noise source to the error microphone (Gy), (3) from the secondary loud- 
speaker to the reference microphone (H,), and (4) from the secondary loudspeaker to the 
error microphone (>) are linear, time-invariant, and known, we can design a linear filter 
H such that e(n) = 0. For example, if the effects of H; and H; are negligible, the filter 
H should invert G, to obtain v(t) and then replicate Gy to synthesize v(t) & y(t). The 
quality of cancelation depends on the accuracy of these two modeling processes. 


x(t) 
Zone of quiet 


Reference 
microphone 


Adaptive 
active noice 
controller 


Secondary 
loudspeaker 


Error 


e()=y()+3O | microphone 


FIGURE 1.21 
Block diagram of the basic components of an active noise control system. 


In practice, the acoustic environment is unknown and time-varying. Therefore, we 
cannot design a fixed ANC filter with satisfactory performance. The only feasible solution 
is to use an adaptive filter with the capacity to identify and track the variation of the various 
acoustic paths and the spectral characteristics of the noise source in real time. The adaptive 
ANC filter adjusts its characteristics by trying to minimize the energy of the error signal 
e(n). Adaptive ANC using digital signal processing technology is an active area of research, 
and despite several successes many problems remain to be solved before such systems find 
their way to more practical applications (Crawford et al. 1997). 


1.4.2 Features of Adaptive Filters 


Careful inspection of the applications discussed in the previous section indicates that every 
adaptive filter consists of the following three modules (see Figure 1.22). 
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1. Filtering structure. This module forms the output of the filter using measurements of 
the input signal or signals. The filtering structure is linear if the output is obtained as 
a linear combination of the input measurements; otherwise, it is said to be nonlinear. 
The structure is fixed by the designer, and its parameters are adjusted by the adaptive 
algorithm. 

2. Criterion of performance (COP). The output of the adaptive filter and the desired 
response (when available) are processed by the COP module to assess its quality with 
respect to the requirements of the particular application. 

3. Adaptive algorithm. The adaptive algorithm uses the value of the criterion of perfor- 
mance, or some function of it, and the measurements of the input and desired response 
(when available) to decide how to modify the parameters of the filter to improve its 
performance. 


FIGURE 1.22 
Basic elements of a general 
adaptive filter. 
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Every adaptive filtering application involves one or more input signals and a desired 
response signal that may or may not be accessible to the adaptive filter. We collectively 
refer to these relevant signals as the signal operating environment (SOE ) of the adaptive 
filter. The design of any adaptive filter requires a great deal of a priori information about the 
SOE and a deep understanding of the particular application (Claasen and Mecklenbrauker 
1985). This information is needed by the designer to choose the filtering structure and the 
criterion of performance and to design the adaptive algorithm. To be more specific, adaptive 
filters are designed for a specific type of input signal (speech, binary data, etc.), for specific 
types of interferences (additive white noise, sinusoidal signals, echoes of the input signals, 
etc.), and for specific types of signal transmission paths (e.g., linear time-invariant or time- 
varying). After the proper design decisions have been made, the only unknowns, when 
the adaptive filter starts its operation, are a set of parameters that are to be determined by 
the adaptive algorithm using signal measurements. Clearly, unreliable a priori information 
and/or incorrect assumptions about the SOE can lead to serious performance degradations 
or even unsuccessful adaptive filter applications. 

If the characteristics of the relevant signals are constant, the goal of the adaptive filter 
is to find the parameters that give the best performance and then to stop the adjustment. 
However, when the characteristics of the relevant signals change with time, the adaptive 
filter should first find and then continuously readjust its parameters to track these changes. 

A very influential factor in the design of adaptive algorithms is the availability of a 
desired response signal. We have seen that for certain applications, the desired response 
may not be available for use by the adaptive filter. In this book we focus on supervised 


adaptive filters that require the use of a desired response signal and we simply call them 
adaptive filters (Chapter 10). Unsupervised adaptive filters are discussed in Chapter 12. 

Suppose now that the relevant signals can be modeled by stochastic processes with 
known statistical properties. If we adopt the minimum mean square error as a criterion 
of performance, we can design, at least in principle, an optimum filter that provides the 
ultimate solution. From a theoretical point of view, the goal of the adaptive filter is to 
replicate the performance of the optimum filter without the benefit of knowing and using 
the exact statistical properties of the relevant signals. In this sense, the theory of optimum 
filters (see Chapters 6 and 7) is a prerequisite for the understanding, design, performance 
evaluation, and successful application of adaptive filters. 


1.5 ARRAY PROCESSING 


Array processing deals with techniques for the analysis and processing of signals collected 
by a group of sensors. The collection of sensors makes up the array, and the manner in which 
the signals from the sensors are combined and handled constitutes the processing. The type of 
processing is dictated by the needs of the particular application. Array processing has found 
widespread application in a large number of areas, including radar, sonar, communications, 
seismology, geophysical prospecting for oil and natural gas, diagnostic ultrasound, and 
multichannel audio systems. 


1.5.1 Spatial Filtering or Beamforming 


Generally, an array receives spatially propagating signals and processes them to emphasize 
signals arriving from a certain direction; that is, it acts as a spatially discriminating filter. 
This spatial filtering operation is known as beamforming, because essentially it emulates 
the function of a mechanically steered antenna. An array processor steers a beam to a 
particular direction by computing a properly weighted sum of the individual sensor signals. 
An example of the spatial response of the beamformer, known as the beampattern, is shown 
in Figure 1.23. The beamformer emphasizes signals in the direction to which it is steered 
while attenuating signals from other directions. 


x(n) (i) 
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Xy(n) (cir) 
FIGURE 1.23 


Example of the spatial response of an array, known as a beampattern, that 
emphasizes signals from a direction of interest, known as the look direction. 


In the case of an array with sensors equally spaced on a line, known as a uniform 
linear array (ULA), there is a direct analogy between beamforming and the frequency- 
selective filtering of a discrete-time signal using a finite impulse response (FIR) filter. This 
analogy between a beamformer and an FIR filter is illustrated in Figure 1.24. The array of 
sensors spatially samples the impinging waves so that in the case of a ULA, the sampling 
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FIGURE 1.24 
Analogy between beamforming and frequency-selective FIR filtering. 
FIGURE 1.25 
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is performed at equal spatial increments. By contrast, an FIR filter uses a uniformly time- 
sampled signal as its input. Consider a plane wave impinging on an array as in Figure 1.25. 
The spatial signal arrives at each sensor with a delay determined by the angle of arrival 
od. In the case of a narrowband signal, this delay corresponds to an equal phase shift from 
sensor to sensor that results in a spatial frequency across the ULA of 


d., 
u= rt sin d (1.5.1) 


where A is the wavelength of the signal and d is the uniform spacing of the sensors. This 
spatial frequency is analogous to the temporal frequency encountered in discrete-time sig- 
nals. In the beamforming operation, the sensor signals are combined with weights on each 
of the sensor signals just as an FIR filter produces an output that is the weighted sum of 
time samples. As a frequency-selective FIR filter extracts signals at a frequency of interest, 
a beamformer seeks to emphasize signals with a certain spatial frequency (i.e., signals ar- 
riving from a particular angle). Thus, it is often beneficial to view a beamformer as a spatial 
frequency-selective filter. 

Many times an array must contend with undesired signals arriving from other directions, 
which may prevent it from successfully extracting the signal of interest for which it was 
designed. In this case, the array must adjust its response to the data it receives to reject signals 


from these other directions. The resulting array is an adaptive array as the beamforming 
weights are automatically determined by the array during its normal operation without 
the intervention of the designer. Drawing on the frequency-selective FIR filter comparison 
again, we see that an adaptive array is analogous to an adaptive FIR filter that adjusts its 
weights to pass signals at the desired frequency or signals with certain statistical properties 
while rejecting any signals that do not satisfy these requirements. Again, if we can model the 
SOE, using stationary processes with known statistical properties, we can design an optimum 
beamformer that minimizes or maximizes a certain criterion of performance. The optimum 
beamformer can be used to provide guidelines for the design of adaptive beamformers and 
used as a yardstick for their performance evaluation. The analysis, design, and performance 
evaluation of fixed, optimum, and adaptive beamformers are discussed in Chapter 11. 


1.5.2 Adaptive Interference Mitigation in Radar Systems 


The goal of an airborne surveillance radar system is to determine the presence of target 
signals. These targets can be either airborne or found on the ground below. Typical targets 
of interest are other aircraft, ground moving vehicles, or hostile missiles. The desired in- 
formation from these targets is their relative distance from our airborne platform, known as 
the range, their angle with respect to the platform, and their relative speed. The processing 
of the radar consists of the following sequence: 


e Filter out undesired signals through adaptive processing. 
e Determine the presence of targets, a process known as detection. 
e Estimate the parameters of all detected targets. 


To sense these targets, the radar system transmits energy in the direction it is searching 
for targets. The transmitted energy propagates from the airborne radar to the target that 
reflects the radar signal. The reflection then propagates from the target back to the radar. 
Since the radar signal travels at the speed of light (3 x 10° m/s), the round-trip delay between 
transmission and reception of this signal determines the range of the target. The received 
signal is known as the return. The angle of the target is determined through the use of 
beamforming or spatial filtering using an array of sensor elements. To this end, the radar 
forms a bank of spatial filters evenly spaced in angle and determines which filter contains the 
target. For example, we might be interested in the angular sector between —1° < @ < 1°. 
Then we might set up a bank of beamformers in this angular region with a spacing of 
0.5°. If these spatial filters perform this operation nonadaptively, it is often referred to as 
conventional beamforming. 

The detection of target signals is inhibited by the presence of other undesired signals 
known as interference. Two common types of interference are the reflections of the radar 
signal from the ground, known as clutter, and other transmitted energy at the same operating 
frequency as the radar, referred to as jamming. Jamming can be the hostile transmission of 
energy to prevent us from detecting certain signals, or it may be incidental, for example, 
from another radar. Such an interference scenario for an airborne surveillance radar is 
depicted in Figure 1.26. The interference signals are typically much larger than the target 
return. Thus, when a nonadaptive beamformer is used, interference leaks in through the 
sidelobes of the beamformer and prevents us from detecting the target. However, we can 
adjust the beamformer weights such that signals from the directions of the interference are 
rejected while other directions are searched for targets. If the weights are adapted to the 
received data in this way, then the array is known as an adaptive array and the operation 
is called adaptive beamforming. The use of an adaptive beamformer is also illustrated in 
Figure 1.26. We show the spatial response or beampattern of the adaptive array. Note that 
the peak gain of the beamformer is in the direction of the target. On the other hand, the 
clutter and jamming are rejected by placing nulls in the beampattern. 
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FIGURE 1.26 
Example of adaptive beamformer used with an airborne 
surveillance radar for interference mitigation. 


In practice, we do not know the directions of the interferers. Therefore, we need an 
adaptive beamformer that can determine its weights by estimating the statistics of the 
interference environment. If we can model the SOE using stochastic processes with known 
statistical properties, we can design an optimum beamformer that provides the ultimate 
performance. The discussion about adaptive filters in Section 1.4.2 applies to adaptive 
beamformers as well. 

Once we have determined the presence of the target signal, we want to get a better idea 
of the exact angle it was received from. Recall that the beamformers have angles associated 
with them, so the angle of the beamformer in which the target was detected can serve as a 
rough estimate of the angle of the target. The coarseness of our initial estimate is governed 
by the spacing in angle of the filter bank of beamformers, for example, 1°. This resolution 
in angle of the beamformer is often called a beamwidth. To get a better estimate, we can use 
a variety of angle estimation methods. If the angle estimate can refine the accuracy down 
to one-tenth of a beamwidth, for example, 0.1°, then the angle estimator is said to achieve 
10:1 beamsplitting. Achieving an angle accuracy better than the array beamwidth is often 
called superresolution. 


1.5.3 Adaptive Sidelobe Canceler 


Consider the scenario in Figure 1.26 from the adaptive beamforming example for interfer- 
ence mitigation in a radar system. However, instead of an array of sensors, consider a fixed 
(i.e., nonadaptive) channel that has high gain in the direction of the target. This response 
may have been the result of a highly directive dish antenna or a nonadaptive beamformer. 
Sometimes it is necessary to perform beamforming nonadaptively to limit the number of 
channels. One such case arises for very large arrays for which it is impractical to form chan- 
nels by digitally sampling every element. The array is partitioned into subarrays that all 
form nonadaptive beams in the same direction. Then the subarray outputs form the spatial 
channels that are sampled. Each channel is highly directive, though with a lower resolution 
than the entire array. In the case of interference, it is then present in all these subarray chan- 


nels and must be removed in some way. To restore its performance to the interference-free 29 

case, the radar system must employ a spatially adaptive method that removes the interfer- section 1.6 | 

ence in the main channel. The sidelobe canceler is one such method and is illustrated in Organization of the Book 
Figure 1.27. 
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FIGURE 1.27 
Sidelobe canceler with a highly directive main channel and auxiliary 
channels. 


Note that the signal of interest is received from a particular direction in which we assume 
the main channel has a large gain. On the other hand, the jamming signal is received from 
another direction, and since it has much higher power than the attenuation of the antenna 
sidelobes, the jamming interference obscures the signals of interest. This high-gain channel 
is known as the main channel that contains both the signal of interest and the jamming 
interference. The sidelobe canceler uses one or more auxiliary channels in order to cancel 
the main-channel interference. These auxiliary channels typically have much lower gain in 
the direction in which the main channel is directed so that they contain only the interference. 
The signal of interest is weak enough that it is below the thermal noise floor in these 
auxiliary channels. Examples of these auxiliary channels would be omnidirectional sensors 
or even directive sensors pointed in the direction of the interference. Note that for very 
strong signals, the signal of interest may be present in the auxiliary channel, in which case 
signal cancelation can occur. Clearly, this application belongs to the class of multisensor 
interference cancelation shown in Figure 1.15. 

The sidelobe canceler uses the auxiliary channels to form an estimate of the interference 
in the main channel. The estimate is computed by weighting the auxiliary channel in an 
adaptive manner dependent on the cross-correlation between the auxiliary channels and 
the main channel. The estimate of the main-channel interference is subtracted from the 
main channel. The result is an overall antenna response with a spatial null directed at the 
interference source while maintaining high gain in the direction of interest. Clearly, if we had 
sufficient a priori information, the problem could be solved by designing a fixed canceler. 
However, the lack of a priori information and the changing properties of the environment 
make an adaptive canceler the only viable solution. 


1.6 ORGANIZATION OF THE BOOK 


In this section we provide an overview of the main topics covered in the book so as to help 
the reader navigate through the material and understand the interdependence among the 
various chapters (see Figure 1.28). 
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In Chapter 2, we review the fundamental topics in discrete-time signal processing that 
can be used for both deterministic and random signals. Chapter 3 provides a concise review 
of the theory of random variables and random sequences and elaborates on certain topics 
that are crucial to developments in subsequent chapters. Reading these chapters is essential 
to familiarize the reader with notation and properties that are repeatedly used throughout the 
rest of the book. Chapter 5 presents the most practical methods for nonparametric estimation 


of correlation and spectral densities. The use of these techniques for exploratory investi- 
gation of the relevant signal characteristics before performing any modeling or adaptive 
filtering is invaluable. 

Chapters 4 and 6 provide a detailed study of the theoretical properties of signal models 
and optimum filters, assuming that the relevant signals can be modeled by stochastic pro- 
cesses with known statistical properties. In Chapter 7, we develop algorithms and structures 
for optimum filtering and signal modeling and prediction. 

Chapter 8 introduces the general method of least squares and shows how to use it for 
the design of filters and predictors from actual signal observations. The statistical properties 
and the numerical computation of least-squares estimates are also discussed in detail. 

Chapters 9, 10, and 11 use the theoretical work in Chapters 4, 6, and 7 and the prac- 
tical methods in Chapter 8 to develop, evaluate, and apply practical techniques for signal 
modeling, adaptive filtering, and array processing. Finally, Chapter 12 illustrates the use 
of higher-order statistics, presents the basic ideas of blind deconvolution and equalization, 
and concludes with a concise introduction to fractional and random fractal signal models. 
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In many disciplines, signal processing applications nowadays are almost always imple- 
mented using digital hardware operating on digital signals. The basic foundation of this 
modern approach is based on discrete-time system theory. This book also deals with statis- 
tical analysis and processing of discrete-time signals, and modeling of discrete-time sys- 
tems. Therefore, the purpose of this chapter is to focus attention on some important issues 
of discrete-time signal processing that are of fundamental importance to signal processing, 
in general, and to this book, in particular. The intent of this chapter is not to teach topics in 
elementary digital signal processing but to review material that will be used throughout this 
book and to establish a consistent notation for it. There are several textbooks on these topics, 
and it is assumed that the reader is familiar with the theory of digital signal processing as 
found in Oppenheim and Schafer (1989); Proakis and Manolakis (1996). 

We begin this chapter with a description and classification of signals in Section 2.1. 
Representation of deterministic signals from the frequency-domain viewpoint is presented 
in Section 2.2. In Section 2.3, discrete-time systems are defined, but the treatment is focused 
on linear, time-invariant (LTI) systems, which are easier to deal with mathematically and 
hence are widely used in practice. Section 2.4 on minimum-phase systems and system 
invertibility is an important section in this chapter that should be reviewed prior to studying 
the rest of the book. The last section, Section 2.5, is devoted to lattice and lattice/ladder 
structures for discrete-time systems (or filters). A brief summary of the topics discussed in 
this chapter is provided in Section 2.6. 


2.1 DISCRETE-TIME SIGNALS 


The physical world is replete with signals, that is, physical quantities that change as a 
function of time, space, or some other independent variable. Although the physical nature 
of signals arising in various applications may be quite different, there are signals that 
have some basic features in common. These attributes make it possible to classify signals 
into families to facilitate their analysis. On the other hand, the mathematical description 
and analysis of signals require mathematical signal models that allow us to choose the 
appropriate mathematical approach for analysis. Signal characteristics and the classification 
of signals based upon either such characteristics or the associated mathematical models are 
the subject of this section. 


33 


34 


CHAPTER 2 
Fundamentals of 
Discrete-Time Signal 
Processing 


2.1.1 Continuous-Time, Discrete-Time, and Digital Signals 


If we assume that to every set of assigned values of independent variables there corresponds 
a unique value of the physical quantity (dependent variable), then every signal can be 
viewed as a function. The dependent variable may be real, in which case we have a real- 
valued signal, or it may be complex, and then we talk about a complex-valued signal. The 
independent variables are always real. 

Any signal whose samples are a single-valued function of one independent variable is 
referred to as a scalar one-dimensional signal. We will refer to it simply as a signal. These 
signals involve one dependent variable and one independent variable and are the signals 
that we mainly deal with in this book. The speech signal shown in Figure 1.1 provides a 
typical example of a scalar signal. 

Let us now look at both the dependent and independent variables of a signal from a 
different perspective. Every signal variable may take on values from either a continuous set 
of values (continuous variable) or a discrete set of values (discrete variable). Signals whose 
dependent and independent variables are continuous are usually referred to as continuous- 
time signals, and we will denote these signals by the subscript c, such as x(t). In contrast, 
signals where both the dependent and the independent variables are discrete are called 
digital signals. If only the independent variables are specified to be discrete, then we have 
a discrete signal. We note that a discrete signal is defined only at discrete values of the 
independent variables, but it may take on any value. Clearly, digital signals are a subset of 
the set of discrete signals. 

In this book, we mainly deal with scalar discrete signals in which the independent 
variable is time. We refer to them as discrete-time signals. Such signals usually arise in 
practice when we sample continuous-time signals, that is, when we select values at discrete- 
time instances. In all practical applications, the values of a discrete-time signal can only 
be described by binary numbers with a finite number of bits. Hence, only a discrete set of 
values is possible; strictly speaking, this means that, in practice, we deal with only digital 
signals. Clearly, digital signals are the only signals amenable to direct digital computation. 
Any other signal has to be first converted to digital form before numerical processing is 
possible. 

Because the discrete nature of the dependent variable complicates the analysis, the usual 
practice is to deal with discrete-time signals and then to consider the effects of the discrete 
amplitude as a separate issue. Obviously, these effects can be reduced to any desirable level 
by accordingly increasing the number of bits (or word length) in the involved numerical 
processing operations. Hence, in the remainder of the book, we limit our attention to discrete- 
time signals. 


2.1.2 Mathematical Description of Signals 


The mathematical analysis of a signal requires the availability of a mathematical description 
for the signal itself. The type of description, usually referred to as a signal model, determines 
the most appropriate mathematical approach for the analysis of the signal. We use the term 
signal to refer to either the signal itself or its mathematical description, that is, the signal 
model. The exact meaning will be apparent from the context. Clearly, this distinction is 
necessary if a signal can be described by more than one model. We start with the most 
important classification of signal models as either deterministic or random. 


Deterministic signals 


Any signal that can be described by an explicit mathematical relationship is called 
deterministic. In the case of continuous-time signals, this relationship is a given function 
of time, for example, x. (t) = A cos (27 Fot + 8), —cO < t < o&. For discrete-time signals 


that, mathematically speaking, are sequences of numbers, this relationship may be either a 
functional expression, for example, x(n) = a”, —oo <n < oo, ora table of values. 

In general, we use the notation x(n) to denote the sequence of numbers that represent 
a discrete-time signal. Furthermore, we use the term nth sample to refer to the value of this 
sequence for a specific value of n. Strictly speaking, the terminology is correct only if the 
discrete-time signal has been obtained by sampling a continuous-time signal x,(t). In the 
case of periodic sampling with sampling period 7, we have x(n) = x,(nT), —0O <n < 00; 
that is, x(7) is the nth sample of x,(t). Sometimes, just for convenience, we may plot x¢(f) 
even if we deal with the signal x(). Finally, we note that sometimes it is convenient to 
form and manipulate complex-valued signals using a pair of real-valued signals as the real 
and imaginary components. 


Basic signals. There are some basic discrete-time signals that we will repeatedly use 
throughout this book: 


e The unit sample or unit impulse sequence 5(n), defined as 


Stn) = as (2.1.1) 
0 n#0 
e The unit step sequence u(n), defined as 
GOS f nz0 (2.1.2) 
0 n<0O 
e The exponential sequence of the form 
x(n) =a" —o <n<o (2.1.3) 


If a is a complex number, that is, a = rel r > Ow # 0, 7, then x(n) is complex- 
valued, that is, 


x(n) = r%e/20" = xp(n) + jxy(n) (2.1.4) 
where xR(n) =r" coswon and xy(n) =r” sin won (2.1.5) 
are the real and imaginary parts of x(n), respectively. The complex exponential signal 
x(n) and the real sinusoidal signals xp (m) and xj(n), which have a decaying (growing) 


envelope if r < 1(r > 1), are very useful in the analysis of discrete-time signals and 
systems. 


Signal classification. Deterministic signals can be classified as energy or power, peri- 
odic or aperiodic, of finite or infinite duration, causal or noncausal, and even or odd signals. 
Although we next discuss these concepts for discrete-time signals, a similar discussion 
applies to continuous-time signals as well. 


e The total energy or simply the energy of a signal x(n) is given by 


[o,@) 
Y* x@l? = 0 (2.1.6) 
n=—OoO 
The energy is zero if and only if x(n) = 0 for all n. The average power or simply the 
power of a signal x(7) is defined as 


P, = li >0 Pie 
- es = kel? a 


A signal with finite energy, that is, 0 < FE, < oo, is called an energy signal. Signals 
with finite power, that is,0 < P, < o, are referred to as power signals. Clearly, energy 
signals have zero power, and power signals have infinite energy. 
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e A discrete-time signal x(n) is called periodic with fundamental period N if x(n + N) = 
x(n) for all n. Otherwise it is called aperiodic. It can be seen that the complex exponential 
in (2.1.4) is periodic if and only if w9/(27) = k/N, thatis, if wp /(277) is arational number. 
Clearly, a periodic signal is a power signal with power P given by 


aoe Yo kk@l? (2.1.8) 


e We say that a signal x(n) has finite duration if x(n) = 0 forn < N; andn > No, where 
N, and N>2 are finite integer numbers with N; < No. If Nj = —oo and/or Nz = on, the 
signal x(n) has infinite duration. 

e Asignal x(n) is said to be causal if x(n) = 0 forn < 0. Otherwise, it is called noncausal. 

e Finally, a real-valued signal x(7) is called even if x(—n) = x(n) and odd if x(—n) = 
—x(n). 


Other classifications for deterministic signals will be introduced in subsequent sections. 


Random signals 


In contrast to the deterministic signals discussed so far, there are many other signals 
in practice that cannot be described to any reasonable accuracy by explicit mathematical 
relationships. The lack of such an explicit relationship implies that the signal evolves in 
time in an unpredictable manner from the point of view of the observer. Such signals are 
called random. The output of a noise generator, the height of waves in a stormy sea, and the 
acoustic pressures generated by air rushing through the human vocal tract are examples of 
random signals. At this point one could say that complete knowledge of the physics of the 
signal could provide an explicit mathematical relationship, at least within the limits of the 
uncertainty principle. However, such relationships are typically too complex to be of any 
practical use. 

In general, although random signals are evolving in time in an unpredictable manner, 
their average properties can often be assumed to be deterministic; that is, they can be 
specified by explicit mathematical formulas. This concept is key to the modeling of a 
random signal as a stochastic process. 

Thus, random signals are mathematically described by stochastic processes and can be 
analyzed by using statistical methods instead of explicit equations. The theory of probability, 
random variables, and stochastic processes provides the mathematical framework for the 
theoretical study of random signals. 


2.1.3 Real-World Signals 


The classification of various physical data as being either deterministic or random might 
be debated in many cases. For example, it might be argued that no physical data in practice 
can be truly deterministic since there is always a possibility that some unforeseen event in 
the future might influence the phenomenon producing the data in a manner that was not 
originally considered. On the other hand, it might be argued that no physical data are truly 
random since exact mathematical descriptions might be possible if sufficient knowledge 
of the basic mechanisms of the phenomenon producing the data were known. In practical 
terms, the decision as to whether physical data are deterministic or random is usually 
based upon the ability to reproduce the data by controlled experiments. If an experiment 
producing specific data of interest can be repeated many times with identical results (within 
the limits of experimental error), then the data can generally be considered deterministic. If 
an experiment cannot be designed that will produce identical results when the experiment 
is repeated, then the data must usually be considered random in nature. 


2.2 TRANSFORM-DOMAIN REPRESENTATION 
OF DETERMINISTIC SIGNALS 


In the deterministic signal model, signals are assumed to be explicitly known for all time 
from —oo to +o0. In this sense, no uncertainty exists regarding their past, present, or 
future amplitude values. The simplest description of any signal is an amplitude-versus-time 
plot. This “time history” of the signal is very useful for visual analysis because it helps 
in the identification of specific patterns, which can subsequently be used to extract useful 
information from the signal. However, quite often, information present in a signal becomes 
more evident by transformation of the signal into another domain. In this section, we review 
some transforms for the representation and analysis of discrete-time signals. 


2.2.1 Fourier Transforms and Fourier Series 


Frequency analysis is, roughly speaking, the process of decomposing a signal into fre- 
quency components, that is, complex exponential signals or sinusoidal signals. Although 
the physical meaning of frequency analysis is almost the same for any signal, the appro- 
priate mathematical tools depend upon the type of signal under consideration. The two 
characteristics that specify the frequency analysis tools for deterministic signals are 


e The nature of time: continuous-time or discrete-time signals. 
e The existence of harmony: periodic or aperiodic signals. 


Thus, we have the following four types of frequency analysis tools. 


Fourier series for continuous-time periodic signals 


If a continuous-time signal x,(¢) is periodic with fundamental period Ty, it can be 
expressed as a linear combination of harmonically related complex exponentials 
[o,@) 


xet)= >) Xo(kyel**Fot (2.2.1) 
k=—0o 


where Fo = 1/Tp is the fundamental frequency, and 


eo 1 Tp : 
X= - i Xe(te F727 * FoF gy (2.2.2) 
p /0 


which are termed the Fourier coefficients,’ or the spectrum of x¢(t). 
It can be shown that the power of the signal x,(t) is given by Parseval’s relation 


1 Tp ca ig 2 
p= = | Ixe(t) 2 dt = So X09] (2.2.3) 
Tp 0 k=—0o 


Since |X ¢(k)|* represents the power in the kth frequency component, the sequence |X a7: 
—oo <k < oo, is called the power spectrum of x-(t) and shows the distribution of power 
within various frequency components. Since the power of x,(t) is confined to the discrete 
frequencies 0, + Fo, +2Fo,..., we say that x,(t) has a line or discrete spectrum. 


Fourier transform for continuous-time aperiodic signals 
The frequency analysis of a continuous-time, aperiodic signal can be done by using the 
Fourier transform 


X(Py= [ ee de (2.2.4) 


—c 


i We use the notation Xo (k) instead of X¢(k) to distinguish it from the Fourier transform X¢(F’) introduced in 
(2.2.4). 
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which exists if x,(t) satisfies the Dirichlet conditions, which require that x,(t): (1) have a 
finite number of maxima or minima within any finite interval, (2) have a finite number of 
discontinuities within any finite interval, and (3) be absolutely integrable, that is, 


/ . |xc(t)| dt < 00 (2.2.5) 


The signal x,(t) can be synthesized from its spectrum X,(F) by using the following inverse 
Fourier transform formula 


[o,@) 
Xe(t) = / X.(Fyel?**" dF (2.2.6) 
—cC 
The energy of x,(t) can be computed in either the time or frequency domain using 
Parseval’s relation 


Ey = ie Ixe(t)|? dt = a |Xo(F)|° dF (2.2.7) 


The function |X.(F )|? > 0 shows the distribution of energy of x(t) as a function of 
frequency. Hence, it is called the energy spectrum of x,(t). We note that continuous-time, 
aperiodic signals have continuous spectra. 


Fourier series for discrete-time periodic signals 


Any discrete-time periodic signal x(n) with fundamental period N can be expressed 
by the following Fourier series 


N-1 
nya el (2.2.8) 
k=0 
N-1 
where Xp = o > x(nye~ J @7/N)kn (2.2.9) 
n=0 


are the corresponding Fourier coefficients. The basis sequences sy(n) & e/@C7/N)" are 
periodic with fundamental period N in both time and frequency, that is, s,(n + N) = sx(n) 
and spin (nm) = sg(n). 

The sequence Xx, k = 0, +1, +2,..., is called the spectrum of the periodic signal 
x(n). We note that X;41 = Xx; that is, the spectrum of a discrete-time periodic signal is 
discrete and periodic with the same period. 

The power of the periodic signal x(n) can be determined by Parseval’s relation 


1 N-1 N-1 
Pra eB Ix(n)? = om [Xx (2.2.10) 
n= = 


The sequence |X;|? is known as the power spectrum of the periodic sequence x(n). 


Fourier transform for discrete-time aperiodic signals 


Any discrete-time signal that is absolutely summable, that is, 


[o.@) 
a |x(n)| < 00 (2.2.11) 
n=—-C} 
can be described by the discrete-time Fourier transform (DTFT) 
[o,@) 
X(e/®) £ F[x(n)] = ye x(n)e Jo" (2.2.12) 
n=—-C} 
where w = 2zf is the frequency variable in radians per sampling interval or simply in 
radians per sample and f is the frequency variable in cycles per sampling interval or simply 


in cycles per sample. The signal x(n) can be synthesized from its spectrum X (e/®) by the 
inverse Fourier transform 


L. of te, 
x(n) = / X (el el" daw (2.2.13) 
2m Jn 
We will say that x(n) and X (e/®) form a Fourier transform pair denoted by 


x(n) <> X(e/”) (2.2.14) 
The function X (e/®) is periodic with fundamental period 27. If x(n) is real-valued, then 
|X (e/”)| = |X (e-/®)| (even function) and £X (e~/®) = —£ X (e/®) (odd function). 
The energy of the signal can be computed in either the time or frequency domain using 
Parseval’s relation 


[o,@) 
1 f* : 
Ex= > Ix@)P = oe |X (e/®)|? dw (2.2.15) 
n=—C} 
™ |X(ei)/?? 
-| ——— da (2.2.16) 
oa 20 


The function | X (e/”)|?/(2z) = 0 and describes the distribution of the energy of the signal 
at various frequencies. Therefore, it is called the energy spectrum of x(n). 


Spectral classification of deterministic signals 


So far we have discussed frequency analysis methods for periodic power signals and 
aperiodic energy signals. However, there are deterministic aperiodic signals with finite 
power. One such class of signals is the complex exponential Ae/‘@0"+9) sequence [or 
equivalently, the sinusoidal sequence A cos (won + 80)], in which wo /(27r) is not a rational 
number. This sequence is not periodic, as discussed in Section 2.1.2; however it has a line 
spectrum at @ = wo + 27k, for any integer k, since 


x(n) = Ae! onto) — Aeil(@ot2rk)n+60o] k=0,+1,+2,... 


(or at w = +a + 27k for the sinusoidal sequence). Hence such sequences are termed as 
almost periodic and can be treated in the frequency domain in almost the same fashion. 

Another interesting class of aperiodic power signals is those consisting of a linear 
combination of complex exponentials with nonharmonically related frequencies {a}, 
for example, 


L 
x(n) = )° Xye/0™ (2.2.17) 
l=1 


Clearly, these signals have discrete (or line) spectra, but the lines are not uniformly dis- 
tributed on the frequency axis. Furthermore, the distances between the various lines are not 
harmonically related. We will say that these signals have discrete nonharmonic spectra. 
Note that periodic signals have discrete harmonic spectra. 

There is yet another class of power signals, for example, the unit-step signal u(n) 
defined in (2.1.2). The Fourier transform of such signals exists only in the context of the 
theory of generalized functions, which allows the use of impulse functions in the frequency 
domain (Papoulis 1977); for example, the Fourier transform of the unit step u(n) is given 
by 

1 [o,@) 
Flu@)l = ~—=, + Y= 15(w — 27k) (2.2.18) 
k=—oo 
Such signals have mixed spectra. The use of impulses also implies that the line spectrum 
can be represented in the frequency domain as a continuous spectrum by an impulse train. 
Figure 2.1 provides a classification of deterministic signals (with finite power or energy) in 
the frequency domain. 
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FIGURE 2.1 
Spectral classification of deterministic (finite power or energy) signals. 


2.2.2 Sampling of Continuous-Time Signals 


In most practical applications, discrete-time signals are obtained by sampling continuous- 
time signals periodically in time. If x-(t) is a continuous-time signal, the discrete-time 
signal x(n) obtained by periodic sampling is given by 


x(n) = x. (nT) -~O<n<ow (2.2.19) 


where T is the sampling period. The quantity F, = 1/T, the number of samples taken per 
unit of time, is called the sampling rate or sampling frequency. 

Since (2.2.19) established a relationship between the signals x,(t) and x(n), there 
should be a corresponding relation between the spectra 


XPS / . xe(te 727 at (2.2.20) 
and X(el) = > x(nje Jo" 221) 


of these signals. 

To establish a relationship between X,(F') and X (e/), first we need to find a relation 
between the frequency variables F and w. To this end, we note that periodic sampling 
imposes a relationship between f and n, namely, tf = nT = n/ Fy. Substituting t = n/ Fs 
into (2.2.20) and comparing with the exponentials in (2.2.20) and (2.2.21), we see that 

2 s 2uf f E (2.2.22) 
I— =o=2n or == is 
Fs Fs 
Since f appears to be a ratio frequency, it is also called a relative frequency. The term 
normalized frequency is also sometimes used for the discrete-time frequency variable /. 


It can be shown (Proakis and Manolakis 1996; Oppenheim and Schafer 1989) that the 
spectra X,(F) of the continuous-time signal and X (e/”) of the discrete-time signal are 
related by 


[o,@) 
X(ePRFIB) = FY XP —kF) (2.2.23) 


k=—0o 


The right-hand side of (2.2.23) consists of a periodic repetition of the scaled continuous-time 
spectrum Fy X_(F’) with period F,. This periodicity is necessary because the spectrum of any 
discrete-time signal has to be periodic. To see the implications of (2.2.23), let us assume that 
X¢(F) is band-limited, that is, X.(F) = 0 for |F'| > B, as shown in Figure 2.2. According 
to (2.2.23), the spectrum X(F’) is the superposition of an infinite number of replications 
of X,(F) at integer multiples of the sampling frequency F. Figure 2.2(b) illustrates the 
situation when F, > 2B, whereas Figure 2.2(c) shows what happens if F, < 2B. Inthe latter 
case, high-frequency components take on the identity of lower frequencies, a phenomenon 
known as aliasing. Obviously, aliasing can be avoided only if the sampled continuous- 
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(a) Continuous-time Fourier transform: Equation (2.2.20). 
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(b) Discrete-time Fourier transform: F, > 2B. 
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(c) Discrete-time Fourier transform: F, < 2B. 
FIGURE 2.2 


Sampling operation. 
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time signal is band-limited and the sampling frequency Fy is equal to at least twice the 
bandwidth (F; > 2B). This leads to the well-known sampling theorem, which can be stated 
as follows: 


SAMPLING THEOREM. A band-imited, real-valued, continuous-time signal with bandwidth B 
can be uniquely recovered from its samples, provided that the sampling rate F¢ is at least equal 
to twice the bandwidth, that is, provided that F, > 2B. 


If the conditions of the sampling theorem are fulfilled, that is, if X,(F) = Ofor|F| > B 
and F, > 2B, then the signal x,.(t) can be recovered from its samples x(n) = x,(nT) by 
using the following interpolation formula 

= sin [(7/T)(t —nT)] 


xe(t)= )° xe(nT) GTC aay (2.2.24) 


n=—C} 
The minimum sampling rate of Fy = 2B is called the Nyquist rate. In practice, the infi- 
nite summation in (2.2.24) has to be substituted by a finite one. Hence, only approximate 
reconstruction is possible. 


2.2.3 The Discrete Fourier Transform 


The N-point discrete Fourier transform (DFT) of an N-point sequence {x(n),n =0,1,..., 
N — 1} is defined by’ 
N-1 
X(k) = x(nyeFC7/Nn k= 0,1,...,N—1 (2.2.25) 
n=0 
The N-point sequence {x(),n = 0,1,..., NM — 1} can be recovered from its DFT coeffi- 
cients {X(k), k =0,1,..., N — 1} by the following inverse DFT formula: 
pve. 
x(n) = — SS X(keiC7/Nn hp =0,1,...,N—1 (2.2.26) 
N 
k=0 
We note that by its definition, the N-point DFT requires or provides information only for 
N samples of a discrete-time signal. Hence, it does not provide a frequency decomposition of 
the signal because any discrete-time signal must be specified for all discrete-time instances, 
—oo <n < ©. The use of DFT for frequency analysis depends on the signal values 
outside the interval 0 < n < N — 1. Depending on these values, we can obtain various 
interpretations of the DFT. The value of the DFT lies exactly in these interpretations. 


DFT of finite-duration signals. Let x(n) be a finite-duration signal with nonzero val- 
ues over the range 0 <n < N — | and zero values elsewhere. If we evaluate X (eJ”) at N 
equidistant frequencies, say, wm, = (27 /N)k,0 < k < N — 1, we obtain 
N-1 
X (ef@k) = X (e27K/N) — si x(nje—J@n/N)kn _ Xf) (2.2.27) 
n=0 
which follows by comparing the last equation with (2.2.25). This implies that the N-point 
DFT of a finite-duration signal with length N is equal to the Fourier transform of the signal 
at frequencies w, = (20 /N)k,0 < k < N — 1. Hence, in this case, the N-point DFT 
corresponds to the uniform sampling of the Fourier transform of a discrete-time signal at 
N equidistant points, that is, sampling in the frequency domain. 


‘In many traditional textbooks, the DFT is denoted by X (k). We will use the notation X(k) to distinguish the DFT 
from the DTFT X (e/®) function or its samples. 


DFT of periodic signals. Suppose now that x(n) is a periodic sequence with funda- 
mental period N. This sequence can be decomposed into frequency components by using 
the Fourier series in (2.2.8) and (2.2.9). Comparison of (2.2.26) with (2.2.8) shows that 


K(k) =NX, k=0,1,...,N—1 (2.2.28) 


that is, the DFT of one period of a periodic signal is given by the Fourier series coefficients 
of the signal scaled by the fundamental period. Obviously, computing the DFT of a fraction 
of a period will lead to DFT coefficients that are not related to the Fourier series coefficients 
of the periodic signal. 

The DFT can be efficiently computed by using a family of fast algorithms, referred to 
as fast Fourier transform (FFT) algorithms, with complexity proportional to N log, N. Due 
to the efficiency offered by these algorithms, the DFT is widely used for the computation 
of spectra, correlations, and convolutions and for the implementation of digital filters. 


2.2.4 The z-Transform 


The z-transform of a sequence is a very powerful tool for the analysis of linear and time- 
invariant systems. It is defined by the following pair of equations: 


XQ) 4 Z[x(~y]= So xe" (2.2.29) 
x(n) = a § X(z)z"! dz (2.2.30) 
2n j C 


Equation (2.2.29) is known as the direct transform, whereas equation (2.2.30) is referred 
to as the inverse transform. The set of values of z for which the power series in (2.2.29) 
converges is called the region of convergence (ROC) of X(z). A sufficient condition for 
convergence is 


Y> |x@llz| < 00 (2.2.31) 


n=—C 


In general, the ROC is a ring in the complex plane; that is, Rj < |z| < R2. The values 
of R; and R> depend on the nature of the signal x(7). For finite-duration signals, X (z) is a 
polynomial in z~!, and the ROC is the entire z-plane with a possible exclusion of the points 
z = 0 and/or z = +o0. For causal signals with infinite duration, the ROC is, in general, 
Ri < |z| < 00, that is, the exterior of a circle. For anticausal signals [x(n) = 0, > O], the 
ROC is the interior of a circle, that is,0 < |z| < Ro. For two-sided infinite-duration signals, 
the ROC is, in general, a ring R, < |z| < Ry. The contour of integration in the inverse 
transform in (2.2.30) can be any counterclockwise closed path that encloses the origin and 
is inside the ROC. 

If we compute the z-transform on the unit circle of the z-plane, that is, if we set z = e/ 
in (2.2.29) and (2.2.30), we obtain 


X@lpneiw = Xe!) = YI xem (2.2.32) 
x(n) = = / i X(ef”) eI" daw (2.2.33) 


which are the Fourier transform and inverse Fourier transform relating the signals x() and 
X (e/®). This relation holds only if the unit circle is inside the ROC. 
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TABLE 2.1 
Properties of z-Transform. 


The z-transform has many properties that are useful for the study of discrete-time 
signals and systems. Some of these properties are given in Table 2.1. Assuming that the 
involved Fourier transform exists, setting z = e/ in each of the properties of Table 2.1 
gives a corresponding table of properties for the Fourier transform. 

An important family of z-transforms is those for which X (z) is a rational function, that 
is, a ratio of two polynomials in z or z~!. The roots of the numerator polynomial, that is, 
the values of z for which X(z) = 0, are referred to as the zeros of X (z). The roots of the 
denominator polynomial, that is, the values of z for which |X (z)| = 00, are referred to as 
the poles of X(z). Although zeros and poles may occur at z = 0 or z = 00, we usually 
do not count them. As will be seen throughout this book, the locations of poles and zeros 
play an important role in the analysis of signals and systems. To display poles and zeros in 
the z-plane, we use the symbols x and o, respectively. 

The inverse z-transform—that is, determining the signal x(n) given its z-transform 
X (z)—involves the computation of the contour integral in (2.2.30). However, most practical 
applications involve rational z-transforms that can be easily inverted using partial fraction 
expansion techniques. Finally, we note that a working familiarity with the z-transform 
technique is necessary for the complete understanding of the material in subsequent chapters. 


2.2.5 Representations of Narrowband Signals 


A signal is known as a narrowband signal if it is band-limited to a band whose width is 
small compared to the band center frequency. Such a narrowband signal transform X_(F’) is 
shown in Figure 2.3(a), and the corresponding signal waveform x,(t) that it may represent 
is shown in Figure 2.3(b). The center frequency of x_(t) is Fo, and its bandwidthis B, which 
is much less than Fo. It is informative to note that the signal x,(t) appears to be a sinusoidal 
waveform whose amplitude and phase are both varying slowly with respect to the variations 
of the cosine wave. Therefore, such a signal can be represented by 


X¢(t) = a(t) cos [27 Fot + O(t)] (2.2.34) 


where a(t) describes the amplitude variation (or envelope modulation) and 6@(t) describes 
the phase modulation of a carrier wave of frequency Fo Hz. Although (2.2.34) can be 
used to describe any arbitrary signal, the concepts of envelope and phase modulation are 


Property Time domain z-Domain ROC 
Notation x(n) X(z) ROC: R) < |z| < Ru 
x1 (n) X1(Z) ROC]: Ry < |z| < Rin 
X2(n) X2(z) ROC) : Ry < |z| < Roy 
Linearity a,x (n) + agx2(n) a, X 1 (z) + a2X2(z) ROC; M ROC) 
Time shifting x(n —k) z*X(z) R, < |z| < Ry, exceptz = O0ifk >0 
Scaling in the z-domain a" x(n) X(a!z) Ja| Ry < |z| < |a|Ru 
Time reversal - X(z71 — < |e) < — 
ime reversa x(—n) (z*) R) < zl < R, 
Conjugation x*(n) X*(z*) ROC 
dx 
Differentiation nx(n) - a ROC 
Zz 
Convolution X1(n) * X2(n) X1(z)X(z) ROC; N ROC) 
Multiplication x1 (n)x7(n) —— f X1(v)X> (<) vldy RyRy < Iz) < Riy Row 
2nj Jeo v 
CO 


Parseval’s relation 


n=—OO 
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FIGURE 2.3 
Narrowband signal: (a) Fourier transform and (b) waveform. 


meaningless unless a(t) and 6(t) vary slowly in comparison to cos 27 For, or equivalently, 
unless B < Fo. 

In literature, two approaches are commonly used to describe a narrowband signal. In 
the first approach, the signal is represented by using a complex envelope, while in the second 
approach the quadrature component representation is used. By using Euler’s identity, it is 
easy to verify that (2.2.34) can be put in the form 


Xe(t) = Refa(re/P7 oO] = Refa(r) el eft For] (2.2.35) 

Let X(t) £ a(tye/? (2.2.36) 
Then from (2.2.35) we obtain 

Xe(t) = Re[X¢(t)e/?7 /0"] (2.2.37) 


The complex-valued signal x<,(t) contains both the amplitude and phase variations of x¢(f), 
and hence it is referred to as the complex envelope of the narrowband signal x, (t). Similarly, 
again starting with (2.2.34) and this time using the trigonometric identity, we can write 


X(t) = a(t) cos 27 Fot cos 6(t) — a(t) sin 2z Fot sin 0(t) (2.2.38) 
Let xer(t) = a(t) cos O(t) (2.2.39) 
XeQ(t) £ a(t) sin @(t) (2.2.40) 


which are termed the in-phase and the quadrature components of narrowband signal x(t), 
respectively. Then (2.2.38) can be written as 


X(t) = Xe (t) cos 27 Fot — XeQ(t) sin 27 Fot (2.2.41) 
Clearly, the above two representations are related. If we expand (2.2.36), then we obtain 
Xe(t) = Xer(t) + jxcQ(t) (2.2.42) 


which implies that the in-phase and quadrature components are, respectively, the real and 
imaginary parts of the complex envelope x(t). These representations will be used exten- 
sively in Chapter 11. 


Bandpass sampling theorem. One application of the complex-envelope representa- 
tion lies in the optimum sampling of narrowband signals. In a general sense, the narrowband 
signal x,(t) is also a bandpass signal that is approximately band-limited to (Fo + B/2) Hz. 
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According the sampling theorem in Section 2.2.2, the Nyquist sampling rate for x,(t) is 
then 


B 
F.=2(fo+ >) © 2% for B < Fo 


However, since the effective bandwidth of x,.(t) is B/2 Hz, the optimum rate should be B, 
which is much smaller than 2 Fo. To obtain this optimum rate, consider (2.2.34), which we 
can write as 

efl27 Fot tO] 4. ep JL27 For +O] 


xX¢(t) = a(t) cos [27 Fot + 6(t)] = a(t) ; 


a(t)e/? j2nFot a(t)e JO 
2 2 


siete! rh aS siting Pr 


e 12% Fot (2.2.43) 


Using the transform properties from Table 2.1, we see that the Fourier transform of x¢(f) is 
given by 


X-(F) = 5[Xc(F — Fo) + X*(—F — Fo)] (2.2.44) 


The first term in (2.2.44) is the Fourier transform of x,(t) shifted by Fo, and hence it must 
be the positive band-limited portion of X,(F). Similarly, the second term in (2.2.44) is the 
Fourier transform of x%(t) shifted by — Fo (or shifted left by Fo). Now the Fourier transform 
of x¢(t) is X$(—F), and hence the second term must be the negative band-limited portion 
of X¢(F). 

We thus conclude that x,(t) is a baseband complex-valued signal limited to the band 
of width B, as shown in Figure 2.4. Furthermore, note that the sampling theorem of Sec- 
tion 2.2.2 is applicable to real- as well as complex-valued signals. Therefore, we can sample 
the complex envelope x,(t) at the Nyquist rate of B sampling intervals per second; and, 
by extension, we can sample the narrowband signal x,(t) at the same rate without aliasing. 
From (2.2.24), the sampling representation of x,(t) is given by 


ey We (my) sin [7B —7n/B)] 
r= ie (4) CET T (2.2.45) 


n=—-C 


Substituting (2.2.45) and (2.2.36) in (2.2.37), we obtain 


a orn sin [7 B(t — n/B)] jonF 
wio=Re| > % (5) a BG ~n/B) e/ “| 


n=—OoO 
~ in [7 B(t —n/B 
=Re} )* a (=) gi9(n/B) .j2n For Sin [x BU — n/B))] (2.2.46) 
= B az B(t —n/B) 
= n n\7 sin [w B(t —n/B)] 
= ¥ a(=) cos [2x For +0 ( )| 
faa B B aB(t —n/B) 
A X.(F) FIGURE 2.4 
2A 4 Fourier transform of a complex envelope 


Xc(t). 


-B/I2 0 B/2 F 


which is the amplitude-phase form of the bandpass sampling theorem. Using trigonometric 
identity, the quadrature-component form of the theorem is given by 


ma 3 [xa (=) cos 2m Fot — Xeq (5) sin 2a Fo] aa ~ (2.2.47) 


Applications of this theorem are considered in Chapter 11. 


2.3 DISCRETE-TIME SYSTEMS 


In this section, we review the basics of linear, time-invariant systems by emphasizing those 
aspects of particular importance to this book. For our purposes, a system is defined to be 
any physical device or algorithm that transforms a signal, called the input or excitation, 
into another signal, called the output or response. When the system is simply an algorithm, 
it may be realized in either hardware or software. Although a system can be specified from 
its parts and their functions, it will often turn out to be more convenient to characterize a 
system in terms of its response to specific signals. The mathematical relationships between 
the input and output signals of a system will be referred to as a (system) model. In the case 
of a discrete-time system, the model is simply a transformation that uniquely maps the input 
signal x(n) to an output signal y(). This is denoted by 


y(n) = H[x(n)] —o<n<o (2.3.1) 
and is graphically depicted as in Figure 2.5. 


FIGURE 2.5 
x(n) H{ | yin) Block diagram representation of a 
discrete-time system. 


2.3.1 Analysis of Linear, Time-Invariant Systems 


The systems we shall deal with in this book are linear and time-invariant and are always 
assumed to be initially at rest. No initial conditions or other information will affect the 
output signal. 


Time-domain analysis. The output of a linear, time-invariant system can always be 
expressed as the convolution summation between the input sequence x (7) and the impulse 
response or unit sample response sequence h(n) £ H[6(n)] of the system, that is, 

[o,@) 
y(n) = x(n) * h(n) & > x(k)h(n — k) (2.3.2) 
k=—oo 
where « denotes the convolution operation. It can easily be shown that an equivalent ex- 
pression is 
CO 
y(n) = a h(k)x(n — k) = h(n) * x(n) (2.3.3) 
k=—00 

Thus, given the input x (7) to alinear, time-invariant system, the output y(”) can be computed 
by using the impulse response (n) of the system and either formula (2.3.2) or (2.3.3). 

If x(n) and h(n) are arbitrary sequences of finite duration, then the above convolution 
can also be computed by using a matrix-vector multiplication operation. Let x(n),0 < 
n < N—1, and h(n),0 < n < M —1, be two finite-duration sequences of lengths NV 


47 


SECTION 2.3 
Discrete-Time Systems 


48 


CHAPTER 2 
Fundamentals of 
Discrete-Time Signal 
Processing 


and M(< N) respectively." Then from (2.3.3), the sequence y(n) is also a finite-duration 
sequence over 0 <n < L—1 with L 4 N + M — 1 samples. If the samples of y(n) and 
h(n) are arranged in the column vectors y and h, respectively, then from (2.3.3) we obtain 


y(0) x(0) 0 =: 0 

: M-1 | M-1 a BS? 

y(M-1)]  |x@f—-1) ++ x) a 

=|: a re (2.3.4) 

N-1 x(N—1) «++ +++ x(N-M ‘ 
yw —1) (Nv —1) ui | pee 
0 

y(L — 1) 0) --- O x(N-1) 
or y = Xh (2.3.5) 
where the L x M matrix X contains linear shifts in x(n —k) forn = 0,..., N—1, which are 


arranged as rows. The matrix X is termed an input data matrix. It has an interesting property 
that all the elements along any diagonal are equal. Such a matrix is called a Toeplitz matrix, 
and thus X has a Toeplitz structure. Note that the first and the last M — 1 rows of X contain 
zero (or boundary) values. Therefore, the first and the last M — 1 samples of y(7) contain 
transient boundary effects. In passing, we note that the vector y can also be obtained as 

y = Hx (2.3.6) 
in which H is a Toeplitz matrix obtained from (2.3.2). However, we will emphasize the 
approach given in (2.3.5) in subsequent chapters. 

MATLAB provides a built-in function called conv that computes the convolution of two 
finite-duration sequences and is invoked by y = conv (h,x). Alternatively, the convolution 
can also be implemented using (2.3.4) in which the Toeplitz data matrix X is obtained using 
the function toeplitz (see Problem 2.4). 

A system is called causal if the present value of the output signal depends only on 
the present and/or past values of the input signal. Although causality is necessary for the 
real-time implementation of discrete-time systems, it is not really a problem in off-line 
applications where the input signal has already been recorded. A necessary and sufficient 
condition for a linear, time-invariant system to be causal is that the impulse response h(n) = 
Oforn <0. 

Stability is another important system property. There are various types of stability 
criteria. A system is called bounded-input bounded-output (BIBO) stable or simply stable 
if and only if every bounded input, namely, |x(7)| < M, < oo forall n, produces a bounded 
output, that is, |y(1)| < My < oo for all n. Clearly, unstable systems generate unbounded 
output signals and, hence, are not useful in practical applications because they will result in 
an overflow in the output. It can be shown that an LTI system is BIBO stable if and only if 


> |h@)| < 00 (2.3.7) 


Transform-domain analysis. In addition to the time-domain convolution approach, 
the output of a linear, time-invariant system can be determined by using transform tech- 
niques. Indeed, by using the convolution property of the z-transform (see Table 2.1), (2.3.2) 


yields 
Y(z) = H(z)X(z) (2.3.8) 


t Be ‘ : é 
For the purpose of this illustration, we assume that the sequences begin at n = 0, but they may have any arbitrary 
finite duration. 


where X (z), Y(z), and H(z) are the z-transforms of the input, output, and impulse response 
sequences, respectively. The z-transform H(z) = Z[h(n)] of the impulse response is called 
the system function and plays a very important role in the analysis and characterization of 
linear, time-invariant systems. If the unit circle is inside the ROC of H(z), the system is 
stable and H(e/”) provides its frequency response. 

Evaluating (2.3.8) on the unit circle gives 


Y(el®) = H(e!®)X (e/®) (2.3.9) 


where H (e/) is the frequency response function of the system. Since, in general, H(e/”) 
is complex-valued, we have 


H(el®) = |H(el®)\eiHe) (2.3.10) 


and |H(e/®)| is the magnitude response, and & H (e/®) is the phase response of the system. 
For a system with a real impulse response, | H (e/“)| has even symmetry and 4 H (e/®) has 
odd symmetry. The group delay response of a system with frequency response H(e/“) is 
defined as 


t(el®) = ~< cHe) (2.3.11) 


and provides a measure of the average delay of the system as a function of frequency. 


Systems described by linear, constant-coefficient difference equations. A discrete- 
time system is called practically realizable if it satisfies the following conditions: (1) It 
requires a finite amount of memory, and (2) the amount of arithmetic operations required 
for the computation of each output sample is finite. Clearly, any system that does not satisfy 
either of these conditions cannot be implemented in practice. 

If, in addition to being linear and time-invariant, we require a system to be causal and 
practically realizable, then the most general input/output description of such a system takes 
the form of a constant-coefficient, linear difference equation 


P Q 
y(n) =— So ayn-kbh +o dex —k) (2.3.12) 
k=1 k=0 

In case the system parameters {ax, dx} depend on time, the system is linear and time-varying. 
If, however, the system parameters depend on either the input or output signals, then the 

system becomes nonlinear. 
By limiting our attention to constant parameters and evaluating the z-transform of both 

sides of (2.3.12), we obtain 


Q 
deg * 

_Y@) _ k=0 a DO) 
X@) oP ~ AQ) 
1+ YS agz7* 

k=1 


(2.3.13) 


Clearly, a system with a rational system function can be described, within a gain factor, by 
the locations of its poles and zeros in the complex z-plane 


Q 
[a — zz~') 
D . 
H() = = =o (2.3.14) 
[[G = pez 
k=1 


The system described by (2.3.12) or equivalently by (2.3.13) or (2.3.14) is stable if its poles, 
that is, the roots of the denominator polynomial A(z), are all inside the unit circle. 
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The difference equation in (2.3.12) is implemented in MATLAB using the filter 
function. In its simplest form, this function is invoked by y = filter(d,a,x) where 
d = [d0,dl1,...,dQ] and a = [1,al,...,aP] are the numerator and denominator co- 
efficient arrays in (2.3.13), respectively. 

If the coefficients a; in (2.3.12) are zero, we have 


Q 
y(n) = Yo dkx(n =k) (2.3.15) 
k=0 


which compared to (2.3.3) yields 


h(n) = (2.3.16) 
0 elsewhere 

that is, the system in (2.3.15) has an impulse response with finite duration and is called a 
finite impulse response (FIR) system. From (2.3.13), it follows that the system function of 
an FIR system is a polynomial in z~!, and thus H(z) has Q trivial poles at z = 0 and Q 
zeros. For this reason, FIR systems are also referred to as all-zero (AZ) systems. Figure 2.6 
shows a straightforward block diagram realization of the FIR system (2.3.15) in terms of 
unit delays, adders, and multipliers. 


x(n) 


y(n) 


FIGURE 2.6 
FIR filter realization (direct form). 


In MaTLAas, FIR filters are represented either by the values of the impulse response h (7) 
or by the difference equation coefficients d,. Therefore, for computational purposes, we 
can use either the y = conv(h,x) function or the y = filter(d, [1],x) function. There 
is a difference in the outputs of these two implementations that should be noted. The conv 
function produces all values of y(n) in (2.3.4), while the output sequence from the filter 
function provides y(Q),..., y(NV — 1). This can be seen by referring to matrix X in (2.3.4). 
The input data matrix X contains only the first N rows; that is, the output of the filter 
function contains transient effects from the boundary at n = 0. For signal processing 
applications, the use of the filter function is strongly encouraged. 

When a system has both poles and zeros, H(z) can be expressed using partial fraction 
expansion form as follows 


P 


H@)=>> = (2.3.17) 


a 


if the poles are distinct and Q < P. The corresponding impulse response is then given by 


P 
h(n) = ) > Ag(pe)"u(n) (2.3.18) 


k=1 
that is, each pole contributes an exponential mode of infinite duration to the impulse re- 
sponse. We conclude that the presence of any nontrivial pole in a system implies an infinite- 


duration impulse response. We refer to such systems as infinite impulse response (IIR) sys- 
tems. If Q = 0, the system has only poles, with zeros at z = 0, and is called an all-pole 
(AP) system. It should be stressed that although all-pole and pole-zero systems are IIR, 
not all IIR systems are pole-zero (PZ) systems. Indeed, there are many useful systems, for 
example, an ideal low-pass filter, that cannot be described by rational system functions of 
finite order. Figures 2.7 and 2.8 show direct-form realizations of an all-pole and a pole-zero 
system. 


FIGURE 2.7 
All-pole system realization (direct form). 


x(n) y(n) 


x(n) 


y(n) 


FIGURE 2.8 
Pole-zero system realization (direct form). 


2.3.2 Response to Periodic Inputs 


Although the convolution summation formula can be used to compute the response of 
a stable system to any input signal, (2.3.8) cannot be used with periodic inputs because 
periodic signals do not possess a z-transform. However, a frequency domain formula similar 
to (2.3.9) can be developed for periodic inputs. 

Let x(n) be a periodic signal with fundamental period N. This signal can be expanded 
in a Fourier series as 


N=1 
xay= Yo Xe" = 0,1,...,N—1 (2.3.19) 
k=0 
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where X; are the Fourier series coefficients. Substituting (2.3.19) into (2.3.3) gives 


N-1 
y(n) = Ss X,H (ef27k/N ) ep f2rkn/N 
k=0 


(2.3.20) 


where H(e/27*/%) are samples of H (e/”). But (2.3.20) is just the Fourier series expansion 
of y(n), hence 


Yr = H(el?™*/%)x, =k =0,1,...,N—-1 (2.3.21) 


Thus, the response of a linear, time-invariant system to a periodic input is also periodic with 
the same period. Figure 2.9 illustrates, in the frequency domain, the effect of an LTI system 
on the spectrum of aperiodic and periodic input signals. 


X(e/”) Hei) Y¥(e/) 


2 
a) 
2 
® 
a 
<x 
0 7 0 7 

input output 

signal signal 
2 
oO 
Le) 
5 
a. 

0 ao 
He!) Y; 
FIGURE 2.9 
LTI system operation in the frequency domain. 
EXAMPLE 2.3.1. Consider the system 
y(n) = ay(n—1)+ x(n) O<a<l 


If we restrict the inputs of the system to be only periodic signals with fundamental period N, 
determine the impulse response of an equivalent FIR system that will provide an identical output 
to the system described above. 


Solution. The system output can be described by (2.3.21), where 


_ Y@) _ 1 = eh 
A(z) = XQ = ae azn! = Z{au(n)} 


From Figure 2.9, it is clearly seen that every system whose frequency response is identical to 
H(eJ®) at the sampling points w, = (27 /N)k,0 < k < N —1, provides the same output when 
excited by a periodic signal having fundamental period N. An FIR system having this property 
can be obtained by taking the inverse N-point DFT of A(k),0 < k < N—1.The resulting 
impulse response h(n) is simply the N-point periodic extension of h(n) = a”u(n), that is, 

[o,@) 


0° n 
A(n)= Y> h(vtiNny= yoa"tiN = —— 
1=0 


O0<n<N-1 
l-a 


(2.3.22) 


l=—0o 


since h(n + 1N) for / < 0 does not contribute to the sum forO <n < N —1. 


The example above looked simple enough. Unfortunately, for somewhat more compli- 
cated all-pole filters, it becomes very difficult to evaluate the infinite summation in (2.3.22) 
in closed form, even if h(n) is available, which is often not the case. 


2.3.3 Correlation Analysis and Spectral Density 


The investigation of system responses to specific input signals requires either the explicit 
computation of the output signal or measurements to relate characteristic properties of the 
output signal to corresponding characterisitics of the system and the input signal. A funda- 
mental tool needed for such analysis is the correlation between two signals that provides a 
quantitative measure of similarity between two signals. The correlation sequence between 
two discrete-time signals x(n) and y(n) is defined by 
[o,@) 
s- x(n)y*(n — 1) : energy signals 
n=—OCO 
rxy() = N (2.3.23) 
au NGI ps x(n)y"(n —1)  : power signals 
where / is termed the lag (or shift) variable. The autocorrelation sequence of a signal is 
obtained by assuming that y(n) = x(n), that is, if we correlate a signal with itself. Thus 
CO 
es x(n)x*(n — 1) : energy signal 
n=—OCO 


xxl) = N (2.3.24) 
> x(n)x*(n —1)__ : power signal 


lim —— 
N>oo 2N + 1 
In this case, we use the simplified notation r;,(/) or even r(/) if there is no possibility of 
confusion. 
The autocorrelation sequence r,(/) and the energy spectrum of a signal x(n) form a 
Fourier transform pair 


ry (I) <> Ry(e/”) (2.3.25) 


Since, R,(e/®) = |X (e/ Oy the Wiener-Khintchine theorem (2.3.25) is usually used to 
define the energy spectral density function, R, (e/®), Clearly, r,(/) and R, (e/”) do not 
contain any phase information. 

In many instances, we need to evaluate the cross-correlation between the input and 
output signals and the autocorrelation of the output signals. It can be easily shown that 


ryx(1) = h(D) * ry (0) (2.3.26) 
ry) = h*(-) * rye = ra) *& rx OD (2.3.27) 
where m= > h@h*a-) =h@ *h*(-D (2.3.28) 


is the autocorrelation of the impulse response. Taking the z-transform of both sides in the 
above equations, we obtain 


Ryx (Zz) = H(z) Rx(Z) (2.3.29) 


Ry(z) = H* (=) Ryx(z) = Rn(z)Rx(z) (2.3.30) 


* 


and Rn(z) = H(z) H* (=) (2.3.31) 
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where R,(z), Ry(z), and Ry (z) are known as complex spectral density functions. Evaluating 
(2.3.30) on the unit circle z = e/® gives 


Ry (e/®) = Ra(el®) Ry (e/®) = |H(e!”) |? Ry (e/”) (2.3.32) 


The output correlations r;y(/) and ry(/) for a periodic input with fundamental period N are 
computed via their spectral densities using the Fourier series. For example, it can be easily 
shown that 


RO = |H(eP7YN) PRO OO <k<N-1 (2.3.33) 


where Re, Re ) are the power spectral densities of x(m) and y(n), respectively. 

In exploring the properties of the various system models, we shall need to excite them 
by some input. Of particular interest are deterministic inputs that have constant power 
spectrum values (such as the unit sample sequence) or inputs that have constant power 
spectrum envelopes (such as all-pass signals). Since we have already discussed the unit 
sample response, we next focus on all-pass signals. 

All-pass signals have a flat-spectrum, that is, 


Rx (el”) = |X (e/®) |? = G? —a<w<n7 (2.3.34) 


and, therefore, r, (1) = G?(/). The simplest example is x(n) = 6(n — k). Amore interest- 
ing case is that of all-pass signals with nonlinear phase characteristic (see Section 2.4.2). 
The autocorrelation and the spectral density of the output y(7) of LTI systems to all-pass 
excitations can be computed by the formulas used for unit impulse excitations, that is, 


ry(l) = Grn) = G?_ > h@)h*(n 1) (2.3.35) 
and Ry(z) = G°H(z)H* (=) (2.3.36) 


By properly choosing G, we can always assume that h(0) = 1. 


2.4 MINIMUM PHASE AND SYSTEM INVERTIBILITY 


In this section, we introduce the concept of minimum phase and show how it is related to the 
invertibility of linear, time-invariant systems. Several properties of all-pass and minimum- 
phase systems are also discussed. 


2.4.1 System Invertibility and Minimum-Phase Systems 


A system H[-] with input x(7), --oo < n < o, and output y(n),-co < n < Ow, is 
called invertible if we can uniquely determine its input signal from the output signal. This 
is possible if the correspondence between the input and output signals is one-to-one. The 
system that produces x(n), when excited by y(7), is denoted by Hjny and is called the inverse 
of system H. Obviously, the cascade of H and Hiny is the identity system. Obtaining the 
inverse of an arbitrary system is a very difficult problem. However, if a system is linear 
and time-invariant, then if its inverse exists, the inverse is also linear and time-invariant. 
Hence, if h(7) is the impulse response of a linear, time-invariant system and hjny (1) that of 
its inverse, we have 


[x(n) * h(n)] * hinv(a) = x(n) 
or h(n) * hiny(n) = 6(n) (2.4.1) 


Thus, given h(n), —oo <n < &, we can obtain hiny(n), —cO <n < &, by solving the 
convolution equation (2.4.1), which is not an easy task in general. However, (2.4.1) can be 


converted to a simpler algebraic equation using the z-transform. Indeed, using the convo- 
lution theorem, we obtain 


Hiny (2) = Gaz) 


A(z) 
where Hiny(z) is the system function of the inverse system. If H(z) is a pole-zero system, 
that is, 


_ De) 
A(z) = AG (2.4.3) 
’ - Be) 
then Hiny(Z) = D@ (2.4.4) 


Thus, the zeros of the system become the poles of its inverse, and vice versa. Furthermore, 
the inverse of an all-pole system is all-zero, and vice versa. 
EXAMPLE 2.4.1. Consider a system with impulse response 
h(n) = 8(n) — 48(n — 1) 
Determine impulse response of the inverse system. 


Solution. The system function of its inverse is 


Hiny(@) = —j— 
—1,-1 
1 qz 


which has a pole at z = i If we choose the ROC as |z| > 
stable, and 


i the inverse system is causal and 


1 
hiny(n) = (4)"u(n) 
However, if we choose the ROC as |z| < i the inverse system is noncausal and unstable 


hiny(n) = —(4)"u(—n — 1) 


This simple example illustrates that the knowledge of the impulse response of a linear, 
time-invariant system does not uniquely specify its inverse. Additional information such 
as causality and stability would be helpful in many cases. This leads us to the concept of 
minimum-phase systems. 

A discrete-time, linear, time-invariant system with impulse response h(n) is called 
minimum-phase if both the system and its inverse system hjny (7) are causal and stable, that 
is, 


h(n) * hiny(n) = 8(n) (2.4.5) 

hin)=0 n<O and hipy(n)=0 21 <0 (2.4.6) 

2 |h(n)| <oo and s IRiny(n)| < 00 (2.4.7) 
n=0 n=0 


We note that if a system is minimum-phase, its inverse is also minimum-phase. This is very 
important in deconvolution problems, where the inverse system has to be causal and stable 
for implementation purposes. 

Sometimes, especially in geophysical applications, the stability requirements (2.4.7) 


are replaced by the less restrictive’ finite energy conditions 


[oe (oe) 

a |h(n)|? <0o and ye lhiny(n)|? < 00 (2.4.8) 
n=0 n=0 

which are implied by (2.4.7). However, note that (2.4.8) does not necessarily imply (2.4.7). 


"This definition of minimum phase allows singularities (poles or zeros) on the unit circle. 
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Clearly, a PZ system is minimum-phase if all its poles and zeros are inside the unit 
circle. Indeed, if all roots of A(z) and D(z) are inside the unit circle, the system H(z) 
in (2.4.3) and its inverse Hjny(z) in (2.4.4) are both causal and stable. 

In an analogous manner, we can define a maximum-phase system as one in which both 
the system and its inverse are anticausal and stable. A PZ system then is maximum-phase if 
all its poles and zeros are outside the unit circle. Clearly, if H(z) is minimum-phase, then 
H(z~!) is maximum-phase. A system that is neither minimum-phase nor maximum-phase 
is called a mixed-phase system. 


2.4.2 All-Pass Systems 


We shall say that a linear, time-invariant system is all-pass, denoted by Hap (eJ®), if 


|Hap(e!®)|=1 -a1<w<n (2.4.9) 
The simplest all-pass system is characterized by 
Aap (Z) — ge 


which simply time-shifts (delay k < 0, advance k > 0) the input signal. 
A more interesting, nontrivial family of all-pass systems is characterized by the system 
function (dispersive all-pass systems) 


Gptapyz te teP  z-Pat(i/z*) 


Aap (Z) = a 2.4.10 
nf) L+ayz-!+---+apz-? A(z) ( ) 
Indeed, it can be easily seen that 
: 1 

| Hap (e?)|? = Aap (Z) Hay (=) = (2.4.11) 

zZ=eJ® 

In the case of real-valued coefficients, (2.4.10) takes the form 

ap+apjzttestz-P 2 PA(z7! 

Hyp (z) = ~—— = cae) (2.4.12) 


L+ayz-!+---+apz-P AQ) 
The poles and zeros of an all-pass system are conjugate reciprocals of one another; that 
is, they are conjugate symmetric with respect to the unit circle. Indeed, if po is a root 
of A(z), then 1/p% is a root of A*(1/z*). Thus, if po = re/® is a pole of Hap(z), then 
1/p5 = (1/ r)e/® is a zero of the system. This typical pattern is illustrated in Figure 2.10 


x-Pole in x-Pole se 
o-Zero 1 O @Plne o-Zero 1 Ocoee 
Re Re 
oF 1 oF 1 
-1 -1 O 
(a) (b) 
FIGURE 2.10 


Typical pole-zero patterns of a PZ, all-pass system: (a) complex-valued coefficients and 
(b) real-valued coefficients. 


for system functions with both complex and real coefficients. Therefore, the system function 
of any pole-zero all-pass system can be expressed as 


P 1 


Hyp (2) = I] Ree 


i = (2.4.13) 

eat? 1 Bee 
The similar expressions (z~! — pi)/A- pez!) and (1 — pez!) /(z7! — Pp) [the negative 
and inverse of (2.4.13), respectively] are often used in the literature. For systems with real 
parameters, singularities should appear in complex conjugate pairs. 


Properties of all-pass systems. All-pass systems have some interesting properties. We 
list these properties without proofs. Some of these proofs are trivial, and others are explored 
in problems. 


1. The output energy of a stable all-pass system is equal to the input energy; that is, 


lo) 
1 7 i 
Ey= > ly@= =f. | Hap(e!)X (e!”) inser: (2.4.14) 


n=—OO 


due to (2.4.9). This leads to a very interesting property for the cumulative energy of a 
causal all-pass system (see Problem 2.6). 

2. Acausal, stable, PZ, all-pass system with P poles has a phase response 4 Hap (e/®) that 
decreases monotonically from 4 Hap(e/ °) to 4 Hap (e/9) — 2 P as w increases from 0 
to 27 (see Problem 2.7). 

3. All-pass systems have nonnegative group delay, which is defined as the negative of the 
first derivative of the phase response, that is, 


d 
Tap(@) = F< Henle’) >0 (2.4.15) 
This property is a direct result of the second property. 
4. The all-pass system function Hap(z) 
1—az7! 
Hap) = la| < 1 (2.4.16) 
zt-a 
<1 if |z|<1l 
satisfies |Hap(z)| y= 1 if |zj=1 (2.4.17) 
>1 if |z|>1 


For proof see Problem 2.10. 


2.4.3 Minimum-Phase and All-Pass Decomposition 


We next show that any causal, PZ system that has no poles or zeros on the unit circle can 
be expressed as 


A(z) = Amin (Z) Hap (Zz) (2.4.18) 


where Hmin(Z) is minimum-phase and Hap (z) is all-pass, as shown in Figure 2.11. Indeed, 
let H(z) be a non-minimum-phase system with one zero z = 1/a, |a| < 1, outside the unit 
circle and all other poles and zeros inside the unit circle. Then H(z) can be factored as 


H(z) = My(z)(a—z') (2.4.19) 
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FIGURE 2.11 
Minimum phase and all-pass decomposition. 


where H(z) is minimum-phase. Equivalently, (2.4.19) can be expressed as 


Atay 
H(z) = Hi(z)(a — 27!) ——+ 
1—a*z 
-1 
a os uel 
= [Ai (@)1 ~ a") 
—1 
a—-Zz 
= rin (2) aT 


(2.4.20) 


where Amin (Z) is minimum-phase and the factor (a — zy/d —a*z7')is all-pass, because 
|a| < 1. Note that the minimum-phase system was obtained from H(z) by reflecting the 
zero Zz = 1/a, which was outside the unit circle, to the zero z = a* inside the unit circle. This 
approach can clearly be generalized for any PZ system. Thus, given a non-minimum-phase 
PZ system, we can create a minimum-phase one with the same magnitude response (or 
equivalently the same impulse response autocorrelation) by reflecting all poles and zeros 
that are outside the unit circle inside the unit circle. From the previous discussion it follows 
that there are 2? Qth-order AZ systems with the same magnitude response. This is illustrated 


in the following example. 


EXAMPLE 2.4.2. For Q = 2, determine all four second-order AZ systems with the same mag- 


nitude response. 


Solution. For a second-order all-zero system (0 < a < 1,0 < b < 1) we obtain the following 


systems 
Hmin(2) = (1 —az7!)(1—bz7!) — Amax(z) = (1 — az)(1 = bz) 
Hmix1 (2) = (1 — az)(1 — bz!) Hmix2(Z) = (1 — az7!)(1 — bz) 
that have the same spectrum 
R@) = H@H(@!) = — az — bz} — az) — b2) 
and the same autocorrelation 


1+a2b* + (a+b) 1=0 


—(a+b)(1 + ab) 7=1,-1 
rd)= 

ab 1=2,-2 

0 otherwise 


but different impulse and phase responses, as shown in Figure 2.12. 


EXAMPLE 2.4.3. Consider the following all-zero minimum-phase system: 


Hymin(z) = (1 — 0.8e/9-67 2-1) (1 — 0,.8e7 40-67 2-1) 
x ad = 0.8¢/0-97 2-1) (1 = 0.8e7 40-97 2-1) 


(2.4.21) 


(2.4.22) 


(2.4.23) 


(2.4.24) 


Determine the maximum- and mixed-phase systems with the same magnitude response. 
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FIGURE 2.12 

Pole-zero, frequency response, and impulse response plots for minimum-phase (row 1), 
maximum-phase (row 2), mixed-phase | (row 3), and mixed-phase 2 (row 4) systems in 
Example 2.4.2. Note that the abscissa in Phase plots are labeled in units of a radians. 


Solution. To obtain a maximum-phase system with the same magnitude response, we reflect the 
zeros Of Hyyin(z) from inside the unit circle to their conjugate reciprocal locations that are outside 
the unit circle by using the transformation z9 > 1/ zp This leads to the following transformation 
for each first-order factor: 


: 1, 
i= rel@z7ly Sri -el?z-!) (2.4.25) 
r 


The scaling factor r in the right-hand side is included to guarantee that the transformation does 
not scale the magnitude response. The resulting maximum-phase system is 


Hmax(z) = (0.8)4(1 — 1.25¢/9-67 2-1) (1 — 1,.25¢7J0-6% 2-1) 


. 2 (2.4.26) 
x (1 — 1.25¢49-9% 2-1) — 1.25¢7 40-97 2-1) 
If we reflect only the zero at 0.8e+/9-67 | we obtain the mixed-phase system 
Ay (z) = (0.8)2(1 — 1.25e/9-6" 2-1) (1 — 1.25¢— 40-6" 2-1) 
(2.4.27) 


x (1 — 0.824998 2-1)(1 — 0.8e7/0-97 2-1) 
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t j0.97 


Similarly, if we reflect only the zero at 0.8e7 , we obtain the second mixed-phase system 


H(z) = (0.8)7(1 — 0.8e/9-97 2-1) (1 — 0.8e—/0-6% 2-1) 


(2.4.28) 
x (1 — 1.25¢49-9% 2-1) — 1.2527 409% 2-1) 


Figure 2.13 shows the pole-zero, magnitude response, phase response, and group delay plots 
for all four systems. Clearly, the minimum-phase system has the smallest group delay, the 
maximum-phase system has the largest group delay, while the mixed-phase systems have in- 
between amounts of group delay across all frequencies. Finally, it can be easily shown that the 
system Hmax(Z)/Hmin(Z) is an all-pass system. 
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FIGURE 2.13 

Pole-zero and frequency response plots for minimum-phase (row 1), maximum-phase (row 2), 
mixed-phase 1 (row 3), and mixed-phase 2 (row 4) systems in Example 2.4.3. Note that the 
abscissa in Phase plots are labeled in units of w radians while those in Group delay plots are 
labeled in sampling intervals. 


The minimum- (maximum-) phase AZ system has all its zeros inside (outside) the unit 
circle. From (2.4.12), it follows that an all-pass system can be expressed as 


Hyp(z) = Hrmax 2) (2.4.29) 


Amin (Z) 


where Hmin(Z) and Hmax(z) are the Pth-order minimum-phase and maximum-phase sys- 
tems, respectively, with the same magnitude response. Indeed, it can be easily seen that 


1 
Aa (2) =z OS. (=) (2.4.30) 


* 


or Amax(n) = he, (P —n). 

In practice, itis very important to find out if a given system is minimum-phase. Clearly, 
the definition cannot be used in practice because either the system /(n) or its inverse is 
going to be IIR. Furthermore, most of the above properties using either h(n) or H(e/®) are 
not practical for use in real-world systems. However, if we deal with PZ systems, we can 
check if they are minimum-phase by computing the poles and zeros and check if they are 
inside the unit circle. This is, however, a computationally expensive procedure, especially 
for high-order systems. Fortunately, there are several tests that allow us to find out if the 


zeros of a polynomial are inside the unit circle without computing them. See Theorem 2.3. 


Properties of minimum-phase systems. Minimum-phase systems have some very in- 
teresting properties. Next we list some of these properties without proofs. More details can 
be found in Oppenheim and Schafer (1989) and Proakis and Manolakis (1996). 


1. For causal, stable systems with the same magnitude response, the minimum-phase sys- 
tem has algebraically the smallest group delay response at every frequency, that is, 
Tmin(e/”) < t(e/®), for all w. Thus, strictly speaking, minimum-phase systems are 
minimum group delay systems. However, the term minimum-phase has been established 
in the engineering literature. 
2. Of all causal and stable systems with the same magnitude response, the minimum-phase 
system minimizes the “energy delay” 
[o,@) 
So |a@)/? for allk =0,1,...,.00 (2.4.31) 
n=k 

where /(7) is the system impulse response. 

3. The system H(z) is minimum-phase if log |H (e/ ®)| and <H (e/”) form a Hilbert trans- 
form pair. 


EXAMPLE 2.4.4. In this example we illustrate the energy delay property of minimum-phase 
systems. Consider the all-zero minimum-phase system (2.4.24) given in Example 2.4.3 and 
repeated here: 


Hmin(Z) = (1 — 0.8e/9-6% 2-1) (1 — 0.8¢— 40-67 2-1) 
x (1 = 0.8e/09% 2—1)(1 — 0.827 40-97 2-1) 
In the top row of four plots in Figure 2.14, we depict the impulse responses of the minimum-, 
maximum-, and mixed-phase systems. The bottom plot contains the graph of the energy delay 
er |h(n)|2 fork = 0,1,...,4, for each of the systems. As expected, the minimum-phase 
system has the least amount of energy delay while the maximum-phase system has the greatest 


amount of energy delay at each n. The graphs of the energy delays for mixed-phase systems are 
somewhere in between the above two graphs. 


Additional properties of minimum-phase systems are explored in the problems. 


2.4.4 Spectral Factorization 


One interesting and practically useful question is the following: Can we completely de- 
termine the system H(z) when |R, (ef wy? =o? given ry(/) or, equivalently, the spec- 
tral density Ry (e/”)? The answer is not a unique one since all we know either from 
ry(J) or from Ry (e/®) is the magnitude response |H(e/)|, but not the phase response 
4H (e/”). To obtain a unique system from (2.3.35) or (2.3.36), we have to impose additional 
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Impulse response plots of the four systems in the top row and the energy delay plots in the 
bottom row in Example 2.4.4. 


conditions on H(z). One such condition is that of a minimum-phase system. The process 
of obtaining the minimum-phase system that produces the signal y(”) with autocorrelation 
ry(Z) or spectral density Ry(z) is called spectral factorization. Equivalently, the spectral 
factorization problem can be stated as the determination of a minimum-phase system from 
its magnitude response or from the autocorrelation of its impulse response. 


Solving the spectral factorization problem by finding roots of Ry(z) is known as the 


root method, and besides its practical utility, it illustrates some basic principles. 


1. 


2. 


Every rational power spectral density has, within a scale factor, a unique minimum-phase 
factorization. 

There are 2°* rational systems with the same power spectral density, where Q and P 
are numerator and denominator polynomial degrees, respectively. 


. Not all possible rational functions are valid power spectral densities since for a valid 


Ry(z) the roots should appear in pairs, zx, and 1/z;. 


These principles can be generalized to any power spectral density by extending P + 


Q — o. The spectral factorization procedure is guaranteed by the following theorem. 


THEOREM 2.1. If InRy(z) is analytic in an open ring a < |z| < 1/a in the z-plane and the ring 
includes the unit circle, then Ry(z) can be factored as 


2 * 1 
Ry(z) =G Ain @) Ann - 


where Amin (Z) iS a minimum-phase system. 


(2.4.32) 


Proof. Using the analyticity of In Ry(z), we can expand In Ry (z) ina Laurent series (Churchill 
and Brown 1984) as 


[o,@) 
In Ry(z) = Yo gDz! (2.4.33) 
—0o 


where the sequence g(/) is known as the cepstrum of the sequence ry (/) (Oppenheim and Schafer 
1989). Evaluating (2.4.33) on the unit circle, we obtain 


CO 
In Ry(el®) = S> ge J (2.4.34) 
—Co 
1 it F ‘col 
or gD = =| In Ry(e!®)el daw (2.4.35) 
—T 
Since Ry (ed O)c="|¥ (e/ @y/2 is a real, nonnegative function, the sequence g(/) is a conjugate 
symmetric sequence, that is, 
gs) =8*(-l) (2.4.36) 
1 ae ‘ 
and G4 exp 9(0) = exp laf In Ry(e/®) ao] >0 (2.4.37) 
UT J—n 


From (2.4.33), we can express Ry(z) in a factored form as 


ioe) -1 lee) 
Ry(z) = exp b coo" | =exp| )> gMz!+g0)t+ > gz" 
—oo 1 


—cC 


oo -1 
= exp g(0) exp bs coc" | exp| )\ gz! (2.4.38) 
1 —oo 


lee) lee) 
-G? exp b oe" exp b eo: 


1 1 
where we used (2.4.36). After defining 


lee) 
H(z) S exp be oe" Iz] > @ (2.4.39) 
1 


1 = 1 
so that H* (=) = exp bs eos] |z| < = (2.4.40) 
1 


we obtain the spectral factorization (2.3.36) 5 Furthermore, from (2.4.37) we note that the constant 
G? is equal to the geometric mean of Ry (e/). From (2.4.39), note that H(z) is the z-transform 
of a causal and stable sequence, hence it can be expanded as 


H(z) =14+Ad)z! +AQ)z-7 +--+ (2.4.41) 


where h(0) = limz-+o0 H(z) = 1. Also from (2.4.39) H(z) corresponds to a minimum-phase 
system so that from (2.4.40) H*(1/z*) is a stable, anticausal, and maximum-phase system. 


The analyticity of In R,(z) is guaranteed by the Paley-Wiener theorem given below 
without proof (see Papoulis 1991). 


THEOREM 2.2 (PALEY-WIENER THEOREM). The spectral factorization in (2.4.32) is possible 
if Ry(z) satisfies the Paley-Wiener condition 


wu 
i [In Ry(e/®)| dw < oo 
—T 


If H(z) is known to be minimum-phase, the spectral factorization is unique. 


In general, the solution of the spectral factorization problem is difficult. However, it 
is quite simple in the case of signals with rational spectral densities. Suppose that Ry (z) is 
a rational complex spectral density function. Since ry(/) = ad (—J) implies that Ry(z) = 
ROC a"); if z; is a root, then 1/z* is also a root. If z; is inside the unit circle, then 1/z¥ is 
outside. To obtain the minimum-phase system H(z) corresponding to Ry(z), we determine 
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the poles and zeros of Ry(z) and form H(z) by choosing all poles and zeros that are inside 
the unit circle, that is, 

Q 

] Ja =z) 
k=1 

P 


] [@ = mz) 
k=1 


where |zzx| < 1,k =1,2,..., Qand |pz| < 1,k =1,2,...,P. 
Before we illustrate this by an example, it should be emphasized that for real-valued co- 
efficients Ry (e/®) is a rational function of cos w. Indeed, we have from (2.3.36) and (2.3.13) 


H(z)=G (2.4.42) 


ences +) ~ G2 P@DrA/e) 24.4 
Ry(z) = G*H(z)H (= i A(z)A*(1/z*) oe 
: P 
where D(z) = sae and A(z) =1+ be ae Cae 
i k=1 


Clearly, (2.4.43) can be written as 


Q 
ra(0) +2 9° ra) coslo 
7 Ra(e/®) 39) l=1 
Ra(el®) a 
ra(0) +2)" ra(l) coslo 
l=1 
where rg(1) = r4(—l) and rg(/) = r7(—l) are the autocorrelations of the coefficient se- 
quences {do, d),...,dg} and {1, a}, ..., ap}, respectively. Since cos /w can be expressed 
as a polynomial 


(2.4.45) 


Ry(e/°) =G 


I 
coslw = S aj (COs w)! 
i=0 
it follows that Ry (e/) is a rational function of cos w. 
EXAMPLE 2.4.5. Let 
F 1.04 + 0.4cos @ 
Ry(e/®) = —_—_ 
1.25 + cos@ 
Determine the minimum-phase system corresponding to Ry (e/®), 


Solution. Replacing cos w by (e/® + eJ”)/2 or directly by (z + z—!)/2 gives 
1.04 + 0.22 + 0.227! (z+ 5) + 0.2) 
Ry(z) = = 0.4 
1.25 + 0.5z + 0.527! (z + 2)(z +0.5) 
The required minimum-phase system H(z) is 
z+0.2  14+0.2c7! 
z+05 140.527! 


A(z) = (2.4.46) 


2.5 LATTICE FILTER REALIZATIONS 


In Section 2.3, we described simple FIR and IIR filter realizations using block diagram 
elements. These realizations are called filter structures for which there are many different 
types available for implementation (Proakis and Manolakis 1996). In this section, we discuss 
the lattice and lattice-ladder filters. The lattice filter is an implementation of a digital filter 
with rational system functions. This structure is used extensively in digital speech processing 
and in the implementation of adaptive filters, which are discussed in Chapter 10. 


2.5.1 All-Zero Lattice Structures 


In Section 2.3, we discussed a direct-form realization of an AZ filter (see Figure 2.6). In 
this section, we present lattice structures for the realization of AZ filters. These structures 
will be used extensively throughout this book. 

The basic AZ lattice is shown in Figure 2.15. Because the AZ lattice is often used to 
implement the inverse of an AP filter, we begin our introduction to the lattice by a realization 
of the AZ filter 


P 
AQ) =1+4+ 0 az! (2.5.1) 
l=1 
The lattice in Figure 2.15(a) is the two-multiplier, or Itakura-Saito, lattice. The lattice has 
P parameters {k,,, 1 < m < P} that map to the q direct-form parameters via a recursive 
relation that is derived below. 
At the mth stage of the lattice, shown in Figure 2.15(a), we have the relations 


fin) a fn-1(”) = kn&m—1(n —1) l1<m<P (2.5.2) 
&m(n) = ky fm—1(n) - &m—1(n —1) l<m<P (2.5.3) 
and from Figure 2.15(b), we have 

fo(n) = go(n) = x(n) (2.5.4) 
y(n) = fon) (2.5.5) 

Taking the z-transform of fi, (n) and gm(7), we have 
Fin (2) = Fin—1(2) + kim 'Gm—1(2) (2.5.6) 
Gin (2) = k*, Fin—1(2) + 2~' Gm—1(2) (2.5.7) 


Dividing both equations by X(z) and denoting the transfer functions from the input x(n) 
to the outputs of the mth stage by A»,(z) and By,(z), where 
Fin (2) A Gm (Z) 


Bn = 2.5.8 
Foz) ot asia) ee) 


Am (z) = 


Sin) Fin (n) 


8m—1M) 8m (n) 


8 (1) g (1) 


FIGURE 2.15 
All-zero lattice structure. 
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we have Am(Z) = Am—1(2) + kmz7! Bm—1(Z) (2.5.9) 


Bin (Z) = k* Am—1(2) + 2! Bm—1(2) (2.5.10) 
with Ao(z) = Bo(z) = 1 (2.5.11) 
and A(z) = Ap(z) (2.5.12) 


Thus, the desired A(z) is obtained as the transfer function A p(z) at the Pth stage of the 
lattice. Now (2.5.9) and (2.5.10) can be written in matrix form as 


A 1s < shee | (Aes 
| =) e | (2.5.13) 
Bn (2) Kez Bm—1(Z) 
Am—1(Z) 
= Q(z) (2.5.14) 
7 Bm—1(Z) 
i - hegege 
where OQn(zye}, (2.5.15) 
Kin Zz 
Then, using the recursive relation (2.5.13), we obtain 
P 
Ap(z) 1 
= 2.5.16 
oe sl I] Qn(z) H (2.5.16) 
m=1 
If we write Aj, (z) as 
m 
An(z)= > ayz7 (2.5.17) 
1=0 
then we can show that 
av” =1 forall m (2.5.18) 
” 1 
and that BOR oe a7 A, (=) (2.5.19) 
1=0 
that is, form = 1,2,...,P 
(m)* = = 
bi” = Qn] 1=1,2,...,m 1 (2.5.20) 
1 l=m 


The polynomial B,,,(z) is known as the conjugate reverse polynomial of A;,(z) because its 
coefficients are the conjugates of those of A,,(z) except that they are in reverse order. So 
since 


Am) = 1+ az 4 gz? 4. parm (2.5.21) 


then Bula sae ghz eh Gey D 4g (2.5.22) 
If zo is a zero of A(z), then zp ‘is azero of Bm (z). Therefore, if Ay, (z) is minimum-phase, 
then B,,(z) is maximum-phase. 
Equations (2.5.19), (2.5.9), and (2.5.10) can be combined into a single equation 
1 

Am(Z) = Am-1(Z) + kinz AX _4 (=) (2.5.23) 
This equation can be used to derive the following relation between the coefficients at stage 
m in terms of the coefficients at stage m — 1: 


ay = hal) + kas” = 1,2,...,m—1 (2.5.24) 


To solve for the coefficients of the transfer function of the complete P-stage lattice, com- 
pute (2.5.24) recursively, starting with m = 1 until m = P. The final coefficients a; of the 
desired filter A(z) are then given by 


a=a O<I<P (2.5.25) 


By substituting m — / for / in (2.5.24), we have 


a”, =a +k (2.5.26) 


Therefore, a; © and a can be computed simultaneously using Bots go, and k,». 


The lattice parameters k,, can be recovered from the coefficients a; by a backward 
recursion. Eliminating z~! Bm—1(z) from (2.5.9) and (2.5.10) and using (2.5.19), we obtain 
Am(Z) — kmz~™" Aj, 1/2") 

1 = |kn|? 


(m) 
i 


Am-1(Z) = 


(2.5.27) 


The recursion can be started by setting ae? = a,0 <1 < P. Then, with m = P, P — 
1,..., 1, we compute from (2.5.27) 
km = al”) 
1 1=0 
(m—1) 
aD maf — bal 
1 = [km |? 

This is the backward recursion to compute k,, from a;. The computation in (2.5.28) is always 
possible except when some |k,,| = 1. Except for this indeterminate case, the mapping 
between the lattice parameters k,, and the coefficients a; of the corresponding all-zero filter 
is unique. 

The MATLAB function [k] = df2latcf (a) computes lattice coefficients k,, from poly- 
nomial coefficients az using (2.5.28). Similarly, the function [a] = latcf2df(k) computes 
the direct-form coefficients from the lattice form. 

Although the AZ lattice filters are highly modular, their software implementation is 
more complex than the direct-form structures. To understand this implementation, we will 
consider the steps involved in determining one output sample in a P-stage AZ lattice. 
Assume that x(7) is available over 1 <n < N. 


(2.5.28) 


1</<m-1 


Input stage: The describing equation is 
fo) = go(n) =x) = lsn<N 


Thus in the implementation, fo(m) and go(n) can be replaced by the input sample x(n), 
which is assumed to be available in array x. 


Stage 1: The describing equations are 
fir) = fon) + kigotn — 1) = x(n) + kix(n — I) 
gi(n) = KF fon) + go — 1) = ix) +x —- 1) 
Assuming that we have two arrays f and g of length P available to store fi,(n) and gm(n) 
at each n, respectively, and two arrays k and ck of length P to store k,, and k*,, respectively, 
then the MATLaB fragment is 
£(1) =-e(n) + (1) 4x (nR1)3 
g(1) = ck(1)*x(n) + x(n-1); 
At n = 1, we need x (0) in the above equations. This is an initial condition and is assumed 
to be zero. Hence in the implementation, we need to augment the x array by prepending it 


with a zero. This should be done in the initialization part. Similarly, arrays f and g should 
be initialized to zero. 
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Stages 2 through P: The describing equations are 
fin) = fm—-1M) + km m-1(2 — 1) 
&m(n) = kK fm—1 (n) + gm—1(n — 1) 
Note that we need old (i.e., at 7 — 1) values of array g in g,,—1(” — 1). Although it is possible 
to avoid an additional array, for programming simplicity, we will assume that g,,(n — 1) is 


available in an array g_old of length P. This array should also be initialized to zero. The 
MatTLaB fragment is 


3 
I 


= f(m-1) + k(m)*g_old(m-1); 
= ck*f(m-1) + g_old(m-1); 


Q 
= 
I 


Output stage: The describing equation is 
y(n) = fr(n) 


Also we need to store the current g,, (7) values in the g_old array for use in the calculations 
of the next output value. Thus the MATLAB fragment is 


g_old = g; 
y = £(P); 


Now we can go back to stage 1 with new input value and recursively compute the remaining 
output values. 
The complete procedure is implemented in the function y = latcfilt (k,x). 


2.5.2 All-Pole Lattice Structures 


The AZ lattice in Figure 2.15 can be restructured quite simply to yield a corresponding 
all-pole (AP) lattice structure. Let an AP system function be given by 


1 Ad 
P ~ A(z) 
14+ 30 a7 
l=1 


which clearly is the inverse system of the AZ lattice of Figure 2.15. The difference equation 
corresponding to (2.5.29) is 


A(z) = (2.5.29) 


P 
y(n) + So ajy(n =D) = x(n) (2.5.30) 
l=1 
If we interchange x(n) with y(7) in (2.5.30), we will obtain the AZ system of (2.5.1). 
Therefore, the lattice structure of the AP system can be obtained from Figure 2.15(b) by 
interchanging x(n) with y(n). This lattice structure with P stages is shown in Figure 2.16(b). 
To determine the mth stage of the AP lattice, we consider (2.5.4) and (2.5.5) and interchange 
x(n) with y(n). Thus the lattice structure shown in Figure 2.16(b) has 


fp(n) = x(n) (2.5.31) 
as the input and for) = go) = y(n) (2.5.32) 
P 


as the output. The signal quantities { fn (7)},,,-¢ then must be computed in descending order, 
which can be obtained by rearranging (2.5.2) but not (2.5.3). Thus we obtain 


fin—1) = fn) — km &m—12 — 1) (2.5.33) 
and 8m(n) = kK fm—1(0) + Bm—1(n — 1) (2.5.34) 


These two equations represent the mth stage of the all-pole lattice, shown in Figure 2.16(a), 
where f,, (7) and g;,—1(”) are now the inputs to the mth stage and f,,-1(n) and g,,(7) are 


Fn 1) 


g.(n) 8p) 8) g,(n) gin) y(n) 


FIGURE 2.16 
All-pole lattice structure. 


the outputs. The transfer function from the input to the output is the same as that from fp (n) 
to fo(n). This transfer function is the inverse of the transfer function from fo(7) to fp(n). 
From (2.5.8), we conclude that the transfer function from x(m) to y(m) in Figure 2.16 is 
equal to 


Y(z)_ Fo(z)_ _—s 1 
~ X(z) Fp(z)— Ap(z) 


where A p(z) = A(z) in (2.5.29). To multiply (2.5.35) by the gain G, we simply multiply 
either x(n) or y(n) by G in Figure 2.16(d). 


(2.5.35) 


Stability of all-pole systems. A causal LTI system is stable if all its poles are inside 
the unit circle. For all-pole systems described by the denominator polynomial A p(z), this 
implies that all its p roots are inside the unit circle, or alternatively, stability implies that 
Ap(z) is a minimum-phase polynomial. Numerical implementation of polynomial root- 
finding operation is time-consuming. However, the following theorem shows how the lattice 
coefficients {kj} ae can be used for stability purposes. 


THEOREM 2.3. The polynomial 


Ap) =14+al zt 4... az? (2.5.36) 
is minimum-phase, that is, has all its zeros inside the unit circle if and only if 
lkm|<1 Il<m<P G3.37) 


Proof. See Appendix E. 


Therefore, if the lattice parameters kj, in Figure 2.16 are less than unity in magnitude, 
then the all-pole filter H (z) in (2.5.35) is minimum-phase and stable since A(z) is guaranteed 
to have all its zeros inside the unit circle. 

Since the AP lattice coefficients are derived from the same procedure used for the AZ 
lattice filter, we can use the k = df2latcf(a) function in MATLAB. Care must be taken 
to ignore the ko coefficient in the k array. Similarly, the a = latcf2df(k) function can be 
used to convert the lattice k,, coefficients to the direct-form coefficients a; provided that 
ko = 1 is used as the first element of the k array. 
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All-pass lattice 
The transfer function from fp(n) to gp() in Figure 2.16(b) can be written as 
Gp) _ Ge) Fos) 
Fpe(z)  Go(z) Fp(z) 
where we used the fact that Fo(z) = Go(z). From (2.5.8) and (2.5.19), we conclude that 
Gp(z) __ Betz) zPA*(Afzt) ap tapas tte te? 
Fe(z) Ap) AQ) AD Haye! +--+» + apz-P 


which is the transfer function of an all-pass filter, since its magnitude on the unit circle is 
unity at all frequencies. 


(2.5.38) 


(2.5.39) 


2.6 SUMMARY 


In this chapter we have reviewed the fundamental concepts of discrete-time signal process- 
ing in both the time and frequency domains. We introduced usual definitions and descriptions 
of signals, and we provided the analytical tools for linear system operations. Significant 
attention was also given to those topics that will be used extensively in the rest of the book. 
These topics include minimum-phase systems, inverse systems, and spectral factorization. 
Finally, filters, which will be used in the chapter on adaptive filters, were discussed in greater 
detail. It is important to grasp the material discussed in this chapter since it is fundamental 
to understanding concepts presented in the remaining chapters. Therefore, the reader should 
also consult any one of the widely used references on this subject (Proakis and Manolakis 
1996; Oppenheim and Schafer 1989). 


PROBLEMS 


2.1 A continuous-time signal x¢(t) is sampled by an A/D converter to obtain the sequence x(n). It 
is processed by a digital filter h(n) = 0.8”u(n) to obtain the sequence y(n), which is further 
reconstructed using an ideal D/A converter to obtain the continuous-time output yc(t). The 
sampling frequency of A/D and D/A converters is 100 sampling intervals per second. 


(a) If xc(t) = 2cos (40zt + 2/3), what is the digital frequency wo in x(n)? 

(b) If x(t) is as given above, determine the steady-state response ye,ss(f). 

(c) Determine two different x¢ (t) signals that would give the same steady-state response yc ss (t) 
above. 


2.2 Let x(n) be a sinusoidal sequence of frequency wg and of finite length N, that is, 


AcOS won O<n<N-I1 
x(n) = : 
0 otherwise 
Thus x(n) can be thought of as an infinite-length sinusoidal sequence multiplied by a rectangular 
window of length NV. 


(a) If the DTFT of x(n) is expressed in terms of the real and imaginary parts as 
X(e!) © XR(@) + jXi(@) 


determine analytical expressions for Xp (w) and X](w). Express cos w in terms of complex 
exponentials and use the modulation property of the DTFT to arrive at the result. 

(b) Choose N = 32 and wo = 7/4, and plot Xp (@) and X](w) for w € [—z, z]. 

(c) Compute the 32-point DFT of x(), and plot its real and imaginary samples. Superimpose 
the above DTFT plots on the DFT plots. Comment on the results. 

(d) Repeat the above two parts for N = 32 and wo = 1.1/4. Why are the plots so markedly 
different? 


2.3 


2.4 


2.5 


2.6 


2.7 


2.8 


Let x(n) = cos (3n/4), and assume that we have only 16 samples available for processing. 


(a) Compute the 16-point DFT of these 16 samples, and plot their magnitudes. (Make sure that 
this is a stem plot.) 

(b) Now compute the 32-point DFT of the sequence formed by appending the above 16 samples 
with 16 zero-valued samples. This is called zero padding. Now plot the magnitudes of the 
DFT samples. 

(c) Repeat part (b) for the 64-point sequence by padding 48 zero-valued samples. 

(d) Explain the effect and hence the purpose of the zero padding operation on the DTFT spec- 
trum. 


Let x(n) = {1, 2, 3,4, 3, 2, 1} and h(n) = {-1,0, 1}. 


(a) Determine the convolution y(n) = x(n) * h(n) using the matrix-vector multiplication 
approach given in (2.3.5). 

(b) Develop a MaTLaB function to implement the convolution using the Toeplitz matrix in 
(2.3.4). The form of the function should be y = convtoep(x,h). 

(c) Verify your function, using the sequences given in part (a) above. 


Let x(n) = (0.9)"u(n). 


(a) Determine x(n) * x(n) analytically, and plot its first 101 samples. 

(b) Truncate x(n) to the first 51 samples. Compute and plot the convolution x (1) * x(n), using 
the conv function. 

(c) Assume that x (7) is the impulse response of an LTI system. Determine the filter function 
coefficient vectors a and b. Using the filter function, compute and plot the first 101 
samples of the convolution x(n) * x(n). 

(d) Comment on your plots. Which MATLAB approach is best suited for infinite-length sequences 
and why? 


Let Hap(z) be a causal and stable all-pass system excited by a causal input x (7) producing the 
response y(n). Show that for any time ng, 


no no 
Yi p@? s Yo ew? (P.1) 
n=0 n=0 


This problem examines monotone phase-response property of a causal and stable PZ all-pass 
system. 


(a) Consider the pole-zero diagram of a real first-order all-pass system 
Prz 
1— pz! 
Show that its phase response decreases monotonically from z (at m = 0) to —z (at wm = 27). 
(b) Consider the pole-zero diagram of a real second-order all-pass system 


re (740) —z! (r£9)* — 27! 
va — 
1—(réo)*z—! | | 1-7 £6)z7! 
Show that its phase response decreases monotonically as w increases from 0 to 7. 
(c) Generalize the results of parts (a) and (b) to show that the phase response of a causal and 


stable PZ all-pass system decreases monotonically from <[H (e/ 9] to <[H (ed 9) —2nP 
as w increases from 0 to z. 


A(z) = 


This problem explores the minimum group delay property of the minimum-phase systems. 
(a) Consider the following stable minimum-, maximum-, and mixed-phase systems 
Hmin(Z) = (1 — 0.2527!) + 0.527!) 
Hmax(z) = 0.25 — 27!1)0.5 +271) 
Hmix(z) = (1 — 0.25271) (0.5 + 271) 


which have the same magnitude response. Compute and plot group delay responses. Observe 
that the minimum-phase system has the minimum group delay. 
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(b) Using (2.4.18) and Problem 2.7, prove the minimum group delay property of the minimum- 
phase systems. 


2.9 Given the following spectral density functions, express them in minimum- and maximum-phase 


components. 
1—2.5z-14 7-2 
a) Ry(z) = 
FG) 1 — 2.0527! + 2-2 
322 — 10 +3272 
(b) Ry(z) = 


322 + 10+ 32~? 
2.10 Consider the all-pass system function Hap(z) given by 
ja| <1 (P.2) 


(a) Determine | Hap (z)|? as a ratio of polynomials in z. 
(b) Show that 


Dini ©) — Afi (2) = (lz? — 0 = la?) 


ome | Hap (2)? a 1a 
Alp) @) 
(c) Using |a| < 1 and the above result, show that 
<i if |z| <1 
| Hap(z)| y= 1 if |zj=1 
>1 if |z|>1 


2.11 Consider the system function of a stable system of the form 
a+bz! +cz-7 
c+ bz-! + az~? 
(a) Show that the magnitude of the frequency response function | H (eJ®)| is equal to 1 for all 


frequencies, that is, it is an all-pass system. 
(b) Let 


A(z) = 


3-22-1427? 

1 — 2z7! 4 3z-2 

Determine both the magnitude and the phase of the frequency response H (e/®), and plot 
these functions over [0, zr]. 


H(z) = 


2.12 Consider the system function of a third-order FIR system 
H(z) = 12 + 2827! — 2927 — 60277 


(a) Determine the system functions of all other FIR systems whose magnitude responses are 
identical to that of H(z). 

(b) Which of these systems is a minimum-phase system and which one is a maximum-phase 
system? 

(c) Let hy (n) denote the impulse response of the kth FIR system determined in part (a) and 
define the energy delay of the kth system by 


CO 
Ex(n) = YO |hg(m)? OS <3 


m=n 


for all values of k. Show that 
Emin(1) < Ex(n) < Emax(n) O<n<3 
and Emin(0©) = Ex (oo) = Emax (00) = 0 


where Ein (1) and Emax (”) are energy delays of the minimum-phase and maximum-phase 
systems, respectively. 


2.13 Consider the system function 


A(z) = 
1 


(a) Show that the system H(z) is not minimum-phase. 

(b) Construct a minimum-phase system H,yj,(z) such that | Hin (e/@)| = |H (e/)|. 

(c) Is H(z) amaximum-phase system? If yes, explain why. If not, then construct a maximum- 
phase system Hmax(z) such that |Hmax(e/®)| = |H(e/®)|. 


2.14 Implement the following system as a parallel connection of two all-pass systems: 


_ 3+ 9z—! 492-2 4 32-3 


A(z 
@) 124+ 10z-1 + 27-2 


2.15 Determine the impulse response of an all-pole system with lattice parameters 
ky =0.2 ky = 0.3 k3 =0.5 kg =0.7 


Draw the direct- and lattice form structures of the above system. 
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So far we have dealt with deterministic signals, that is, signals whose amplitude is uniquely 
specified by a mathematical formula or rule. However, there are many important examples 
of signals whose precise description (i.e., as deterministic signals) is extremely difficult, 
if not impossible. As mentioned in Section 2.1, such signals are called random signals. 
Although random signals are evolving in time in an unpredictable manner, their average 
properties can be often assumed to be deterministic; that is, they can be specified by explicit 
mathematical formulas. This is the key for the modeling of a random signal as a stochastic 
process. 

Our aim in the subsequent discussions is to present some basic results from the theory 
of random variables, random vectors, and discrete-time stochastic processes that will be 
useful in the chapters that follow. We assume that most readers have some basic knowledge 
of these topics, and so parts of this chapter may be treated as a review exercise. However, 
some specific topics are developed in greater depth with a viewpoint that will serve as a 
foundation for the rest of the book. A more complete treatment can be found in Papoulis 
(1991), Helstrom (1992), and Stark and Woods (1994). 


3.1 RANDOM VARIABLES 


The concept of random variables begins with the definition of probability. Consider an 
experiment with a finite or infinite number of unpredictable outcomes from a universal set, 
denoted by S = {f),f5,...}. A collection of subsets of S containing S itself and that is 
closed under countable set operations is called a o field and denoted by F. Elements of 
F are called events. The unpredictability of these events is measured by a nonnegative set 
function Pr{¢,},k = 1, 2,..., called the probability of event ¢;,. This set function satisfies 
three well-known and intuitive axioms (Papoulis 1991) such that the probability of any event 
produced by set-theoretic operations on the events of S can be uniquely determined. Thus, 
any situation of random nature, abstract or otherwise, can be studied using the axiomatic 
definition of probability by defining an appropriate probability space (S, F, Pr). 

In practice it is often difficult, if not impossible, to work with this probability space for 
two reasons. First, the basic space contains abstract events and outcomes that are difficult to 
manipulate. In engineering applications, we want random outcomes that can be measured 
and manipulated in a meaningful way by using numerical operations. Second, the probability 
function Pr{-} is a set function that again is difficult, if not impossible, to manipulate by using 
calculus. These two problems are addressed through the concept of the random variable. 
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DEFINITION 3.1 (RANDOM VARIABLE). A random variable x(¢) is a mapping that assigns 
a real number x to every outcome ¢ from an abstract probability space. This mapping should 
satisfy the following two conditions: (1) the interval {x(¢) < x} is an event in the abstract 
probability space for every x; (2) Pr{x(¢) = oo} = O and Pr{x(¢) = —oo} = 0. 


A complex-valued random variable is defined by x(¢) = xr(€) + jx1(¢) where xp (¢) 
and x1(¢) are real-valued random variables. We will discuss complex-valued random vari- 
ables in Section 3.2. Strictly speaking, a random variable is neither random nor a variable 
but is a function or a mapping. As shown in Figure 3.1, the domain of a random variable 
is the universal set S, and its range is the real line IR. Since random variables are numbers, 
they can be added, subtracted, or manipulated otherwise. 


Random variable FIGURE 3.1 
x(¢) Graphical illustration of random variable 


Abstracts [ling 
space line | Mapping. 


x(¢4) 
J x(f3) 
BS (G) 
“,° x(f,) 
An important comment on notation. We will use x(¢), y(¢),..., to denote random 


variables and the corresponding lowercase alphabet without parentheses to denote their 
values; for example, x(¢) = x means that the random variable x(¢) takes value equal to 
x. We believe that this notation will not cause any confusion because the meaning of the 
lowercase variable will be clear from the context.’ A specific value of the random variable 
realization will be denoted by x(¢q) = xo (corresponding to a particular event ¢ ¢ in the 
original space). 

A random variable is called discrete-valued if x takes a discrete set of values {x,;}; 
otherwise, it is termed a continuous-valued random variable. A mixed random variable 
takes both discrete and continuous values. 


3.1.1 Distribution and Density Functions 


The probability set function Pr{x(¢) < x} is a function of the set {x(¢) < x}, but it is also 
a number that varies with x. Hence it is also a function of a point x on the real line R. This 
point function is the well-known cumulative distribution function (cdf) F(x) of a random 
variable x(¢) and is defined by 

F,@) = Pr{x(¢) < x} (3.1.1) 


The second important probability function is the probability density function (pdf) fy (x), 


t i ‘ : . 
Traditionally, the uppercase alphabet is used to denote random variables. We have reserved the use of uppercase 
alphabet for transform-domain quantities. 


which is defined as a formal derivative 
dF, (x) 


f(x) = ay 


Note that the pdf f(x) is not the probability, but must be multiplied by a certain interval 
Ax to obtain a probability, that is, 


txr(x)Ax © AF,(x) & Fy (x + Ax) — Fy (x) = Pr{x < x(€) <x + Ax} (3.1.3) 


(B12) 


Integrating both sides of (3.1.2), we obtain 


F(x) = ie flv) dv (3.1.4) 


For discrete-valued random variables, we use the probability mass function (pmf) px, 
defined as the probability that random variable x(¢) takes a value equal to x,, or 


pe= Prix (6) = xy) (3.1.5) 


These probability functions satisfy several important properties (Papoulis 1991), such 
as 


0< Fy(x) < 1 F,(—oo) = 0 Fy, (co) = 1 (3.1.6) 
filx) = 0 i Paya (3.1.7) 


Using these functions and their properties, we can compute the probabilities of any event 
(or interval) on R. For example, 


x2 
Pr{x, <x(G) < x2} = Fy (x2) — Pye) = / fx (x) dx (3.1.8) 
x] 


3.1.2 Statistical Averages 


To completely characterize a random variable, we have to know its probability density 
function. In practice, it is desirable to summarize some of the key aspects of a density 
function by using a few numbers rather than to specify the entire density function. These 
numbers, which are called statistical averages or moments, are evaluated by using the 
mathematical expectation operation. Although density functions are needed to theoretically 
compute moments, in practice, moments are easily estimated without the explicit knowledge 
of density functions. 


Mathematical expectation 


This is one of the most important operations in the theory of random variables. It is 
generally used to describe various statistical averages, and it is also needed in estimation 
theory. The expected or mean value of a random variable x(¢) is given by 


sy Xk Dk x(¢) discrete 


E{x(Q)) 2 py =} oy (3.1.9) 
/ xf (x) dx x(¢) continuous 


Although, strictly speaking, to compute FE {x(¢)} we need the definitions for both the discrete 
and continuous random variables, we will follow the engineering practice of using the 
expression for the continuous random variable (which can also describe a discrete random 
variable if we allow impulse functions in its pdf). The expectation operation computes a 
statistical average by using the density f, (x) as a weighting function. Hence, the mean ju, 
can be regarded as the “location” (or the “center of gravity”) of the density f(x), as shown 
in Figure 3.2(a). If f(x) is symmetric about x = a, then w, = a and, in particular, if fi, (x) 
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fe, fe la 


My 


(a) Mean (b) Variance 


FO) fr, Sf, 
rey *— Positive 


Negative # Positive Negative 


£0," 


(c) Skewness (d) Kurtosis 


FIGURE 3.2 
Illustration of mean, standard deviation, skewness, and kurtosis. 


is an even function, then jz, = 0. One important property of expectation is that it is a linear 
operation, that is, 

E{ax(f) + B} =ap, + B (3.1.10) 
Let y(¢) = g[x(¢)] be a random variable obtained by transforming x(¢) through a suitable 
function.’ Then the expectation of y(¢) is given by 


Elon) 4 Eleixon = f g(x) fx (x) dx (3.1.11) 
Moments 


Using the expectation operations (3.1.9) and (3.1.11), we can define various moments 
of the random variable x(¢) that describe certain useful aspects of the density function. Let 
glx(f)] = x" (¢). Then 


roo & Bierce) = [ 


CO 


x!” f(x) dx (3.1.12) 


is called the mth-order moment of x(¢). In particular, 70 = |, and the first-order moment 


7) = p,. The second-order moment 72 — E{x?(e)} is called the mean-squared value, 
and it plays an important role in estimation theory. Note that 
E(x°(o)} # E{x(6)} (3.1.13) 


Corresponding to these moments we also have central moments. Let g[x(¢)] = [x(¢)— 
L,)”, then 


yO 2 Ex@) = pI") = / (x — jy)” f(x) dx (3.1.14) 


is called the mth-order central moment of x(¢). In particular, yO = | and y = 0, which 
is obvious. Clearly, a random variable’s moments and central moments are identical if its 


* Such a function g(-) is called a Baire function (Papoulis 1991). 


mean value is zero. The second central moment is of considerable importance and is called 
the variance of x(¢), denoted by o. Thus 


ar[x(¢)] = o2 S yO = Eflx(g) — uP} (3.1.15) 


The quantity 0, = ye is called the standard deviation of x(¢) and is a measure of 
the spread (or dispersion) of the observed values of x(¢) around its mean jz, [see Figure 
3.2(b)]. The relation between a random variable’s moments and central moments is given 
by (see Problem 3.3) 


m 


=, Je sgt Aaa (3.1.16) 


k=0 


In particular, and also from (3.1.15), we have 
o% =r — wy = E{x?(Q)} — E*{x(o)} G14a7) 


The quantity skewness is related to the third-order central moment and characterizes 
the degree of asymmetry of a distribution around its mean, as shown in Figure 3.2(c). It is 
defined as a normalized third-order central moment, that is, 


3 
= 1 
Skew £9 4 £ [=] |- a 9 (3.1.18) 


Ox 


and is a dimensionless quantity. It is a pure number that attempts to describe leaning of the 
shape of the distribution. The skewness is zero if the density function is symmetric about its 
mean value, is positive if the shape leans towards the right, or is negative if it leans towards 
the left. 

The quantity related to the fourth-order central moment is called kurtosis, which is also 
a dimensionless quantity. It measures the relative flatness or peakedness of a distribution 
about its mean as shown in Figure 3.2(d). This relative measure is with respect to a normal 
distribution, which will be introduced in the next section. The kurtosis is defined as 


yap [2@Or-u] 1a 
Kurtosis © « all | = te 3 (3.1.19) 


Ox 


where the term —3 makes the kurtosis CY = = 0 for the normal distribution [see (3.1.40) for 
explanation]. 


Chebyshev’s inequality. Auseful result in the interpretation and use of the mean yz and 
the variance o of a random variable is given by Chebyshev’s inequality. Given a random 


variable x(¢) with its mean jz, and variance an, we have the inequality 
1 
Pr{|x(O) — Ux] = kox} < 2 k>0 (3.1.20) 


The interpretation of the above inequality is that regardless of the shape of f; (x), the random 
variable x(¢) deviates from its mean by k times its standard deviation with probability less 
than or equal to 1/k?. 


Characteristic functions 


The Fourier and Laplace transforms find many uses in probability theory through the 
concepts of characteristic and moment generating functions. The characteristic function of 
a random variable x(¢) is defined by the integral 


@,(&) & Efef™®)} = / : fe (xyet®* dx (3.1.21) 
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which can be interpreted as the Fourier transform of f(x) with sign reversal in the complex 
exponential. To avoid confusion with the cdf, we do not use F;(&) to denote this Fourier 
transform. Furthermore, the variable € in ®,(&) is not and should not be interpreted as 
frequency. When j& in (3.1.21) is replaced by a complex variable s, we obtain the moment 
generating function defined by 


®,(s) £ Efe? O} = a fave dx (3.1.22) 


which again can be interpreted as the Laplace transform of f(x) with sign reversal. Ex- 
panding e** in (3.1.22) in a Taylor series at s = 0, we obtain 


2 m 
[xP |, bx" 


&,(s) = Efe") = E {! + sx()+ 
2! m! 


— (2) (m) 
+ sp TF pacer +... 


provided every moment r (mn) exists. Thus from (3. 1.23) we infer that if all moments of x(¢) 
are known (and exist), then we can assemble ®, (s) and upon inverse Laplace transforma- 
tion, we can determine the density function f(x). If we differentiate ®,.(s) with respect to 
5, we obtain 

pom) — AEP = (yn AOE) 


= eee (3.1.24) 
ds” s=0 dé” £=0 


which provides the mth-order moment of the random variable x(¢). 

The functions ®,.(€) and ®, (s) possess all the properties associated with the Fourier 
and Laplace transforms, respectively. Thus, since f; (x) is always a real-valued function, 
®,.(€) is conjugate symmetric; and if f(x) is also an even function, then ®,.(&) is a real- 
valued even function. In addition, they possess several properties due to the basic nature of 
the pdf. Therefore, the characteristic function ®, (€) always exists’ since 


/ fee) dx = i fieie=i 
and ®,.(€) is maximum at the origin, that is, 
|D,()| < (0) = 1 (3.1.25) 


since fy (x) => 0. 


Cumulants 


These statistical descriptors are similar to the moments, but provide better information 
for higher-order moment analysis, which we will consider in detail in Chapter 12. The 
cumulants are derived by considering the moment generating function’s natural logarithm. 
This logarithm is commonly referred to as the cumulant generating function and is given 
by 

W,(s) £ In ®, (8) = In Efe*} (3.1.26) 


When s is replaced by j& in (3.1.26), the resulting function is known as the second char- 
acteristic function and is denoted by W, (&). 
The cumulants Pai of a random variable x(¢) are defined as the derivatives of the 
cumulant generating function, that is, 
Km) & d"[¥x(s)] = (-j)" d"[Wx(§)] 


ds” cath dé” £=0 


aay ee (3.1.27) 


"We will generally choose the characteristic function over the moment generating function. 


Clearly, Pas = 0. It can be shown that (see Problem 3.4) for a zero-mean random variable, 
the first five cumulants as functions of the central moments are given by 


cer) Si =0 (3.1.28) 
c? = y® =o? (3.1.29) 
tO = y®? (3.1.30) 
cO — y® _ 304 (3.1.31) 
cD = y® — 10y Po? (3.1.32) 


which show that the first two cumulants are identical to the first two central moments. 
Clearly due to the logarithmic function in (3.1.26), cumulants are useful for dealing with 
products of characteristic functions (see Section 3.2.4). 


3.1.3 Some Useful Random Variables 


Random variable models are needed to describe (or approximate) complex physical phe- 
nomena using simple parameters. For example, the random phase of a sinusoidal carrier can 
be described by a uniformly distributed random variable so that we can study its statistical 
properties. This approximation allows us to investigate random signals in a sound mathe- 
matical way. We will describe three continuous random variable models although there are 
several other known continuous as well as discrete models available in the literature. 


Uniformly distributed random variable. This is an appropriate model in situations in 
which random outcomes are “equally likely.” Here x(¢) assumes values on R according to 
the pdf 


1 
a<x<b 


fr(x) = 4 o-a (3.1.33) 
0 elsewhere 


where a < b are specified parameters. This pdf is shown in Figure 3.3. The corresponding 


A f(x) 


Uniform 


2°” 


-2 -1 0 1 x 2 


FIGURE 3.3 
Probability density functions of useful random variables. 
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cdf is given by 
0 x<a 
x x—a 
F(x) = | fx) dv = eae a<x<b (3.1.34) 
—C 
1 x>a 
and the characteristic function is given by 
elé> _ ejéa 
®,(§) = ——— (3.1.35) 
i j&(b —a) 
The mean and the variance of this random variable are given by, respectively, 
at+b (b — a)? 
Mx = and ot = a (3.1.36) 


Normal random variable. This is the most useful and convenient model in many ap- 
plications, as we shall see later. It is also known as a Gaussian random variable, and we will 
use both terms interchangeably. The pdf of a normally distributed random variable x(¢) 
with mean j, and standard deviation 0 is given by 


_ 1 1 (x - py 137 
= eee 5 ( a ) (3.1.37) 


where —oo < ¢ < ooando > 0 (see Figure 3.3). The characteristic function of the normal 
random variable is given by 


x (§) = exp(jur§ — 7078) (3.1.38) 
Clearly, the pdf of a normal random variable is completely described by its mean jz, and 
standard deviation o x and is denoted by NV (1, a): We note that all higher-order moments 
of anormal random variable can be determined in terms of the first two moments, that is, 


1-3-5---(m—1)o? if m even 
(m) E m x 
Vx (x) I) 0) if m odd (3122) 


In particular, we obtain the fourth moment as 
y =304 (3.1.40) 


or from (3.1.19), kurtosis = 0, which explains the term —3 in (3.1.19). 

From (3.1.37), we observe that the Gaussian random variable is completely determined 
by its first two moments (mean jz, and variance o2), which means that the higher moments 
do not provide any additional information about the Gaussian density function. In fact, all 
higher-order moments can be obtained in terms of the first two moments [see Equation 
(3.1.39)]. Thus for a non-Gaussian random variable, we would like to know how different 
that random variable is from a Gaussian random variable (this is also known as a departure 
from the Gaussian-ness). This measurement of the deviation from being Gaussian is given by 
the cumulants that were defined in (3.1.27). Roughly speaking, the cumulants are like central 
moments (which measure deviations from the mean) of non-Gaussian random variables for 
Gaussian departure. Also from (3.1.30) and (3.1.31), we see that all higher-order (that 
is, m > 2) cumulants of a Gaussian random variable are zero. This fact is used in the 
analysis and estimation of non-Gaussian random variables (and later for non-Gaussian 
random processes). 


Cauchy random variable. This is an appropriate model in which a random variable 
takes large values with significant probability (heavy-tailed distribution). The Cauchy pdf 
with parameters jz and £ is given by 


1 
fc) = p 


=e (3.1.41) 


and is shown in Figure 3.3. The corresponding cdf is given by 


x— pe 


1 
Fy (x) = 0.5 + — arctan (3.1.42) 
TU 


and the characteristic function is given by 


®,(€) = exp(jus — BI§|) (3.1.43) 


The Cauchy random variable has mean w, = jp. However, its variance does not exist 
because E{x} fails to exist in any sense, and hence the moment generating function does 
not exist, in general. It has the property that the sum of M independent Cauchy random 
variables is also Cauchy (see Example 3.2.3). Thus a Cauchy random variable is an example 
of an infinite-variance random variable. 


Random number generators. Random numbers, by definition, are truly unpredictable, 
and hence it is not possible to generate them by using a well-defined algorithm on a computer. 
However, in many simulation studies, we need to use sequences of numbers that appear to 
be random and that possess required properties, for example, Gaussian random numbers 
in a Monte Carlo analysis. These numbers are called pseudo random numbers, and many 
excellent algorithms are available to generate them on a computer (Park and Miller 1988). 
In MATLAB, the function rand generates numbers that are uniformly distributed over (0, 1) 
while the function randn generates (0, 1) pseudo random numbers. 


3.2 RANDOM VECTORS 


In many applications, a group of signal observations can be modeled as a collection of 
random variables that can be grouped to form a random vector. This is an extension of the 
concept of random variable and generalizes many scalar quantities to vectors and matrices. 
One example of a random vector is the case of a complex-valued random variable x(¢) = 
xR (E) + jx1(C), which can be considered as a group of xp (¢) and x;(¢). In this section, we 
provide a review of the basic properties of random vectors and related results from linear 
algebra. We first begin with real-valued random vectors and then extend their concepts to 
complex-valued random vectors. 


3.2.1 Definitions and Second-Order Moments 


A real-valued vector containing M random variables 


x() = [21 (0), x2(0),.-- xm)" (3.2.1) 


is called a random M vector or a random vector when dimensionality is unimportant. As 
usual, superscript T denotes the transpose of the vector. We can think of a real-valued 
random vector as a mapping from an abstract probability space to a vector-valued, real 
space R™. Thus the range of this mapping is an M-dimensional space. 

Distribution and density functions 

Arandom vector is completely characterized by its joint cumulative distribution func- 
tion, which is defined by 

Fy(x1,-.-,4m) = Pr(xi) < x1,.-.,xm(S) < xm} (3.2.2) 


and is often written as 


Fy(x) = Pr{x(¢) < x} (3.2.3) 
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for convenience. A random vector can be also characterized by its joint probability density 
function, which is defined by 


Pr{xy < x1(6) <x) + Axy,...,xu < xu (6) < xy + Axy} 
f(x) = lim 
Ax;>0 Ax,---Axy 
Axy>0 
) 
A 
ener EF. 
Ox] OxM x) 
(3.2.4) 
The function 
Fx; (x;) = / tee / Fx (&) dx, --- dx j-1 dx j41 -++ dxy (3.2.5) 
(M—1) 


is known as a marginal density function and describes individual random variables. Thus 
the probability functions defined for a random variable in the previous section are more 
appropriately called marginal functions. The joint pdf f, (x) must be multiplied by a certain 
M-dimensional region Ax to obtain a probability. From (3.2.4) we obtain 


Fy) = i af fx) dv1--- dvy = [ fx(v) dv (3.2.6) 


These joint probability functions also satisfy several important properties that are similar to 
(3.1.6) through (3.1.8) for random variables. In particular, note that both fx(x) and Fx (x) 
are positive multidimensional functions. 

The joint [and conditional probability (see Papoulis 1991)] functions can also be used 
to define the concept of independent random variables. Two random variables x1(¢) and 
x2(¢) are independent if the events {x (¢) < x1} and {x2(¢) < x2} are jointly independent, 
that is, if 


Pr{x1(f) < x1, x2() S x2} = Pr{xi() S x1} Pr{xo(f) < x2} 
which implies that 


Fyy xy (41, X2) = Fy, (41) Fy (x2) and Foci jx2 (1, £2) = fe, OD) fin (2) (3.2.7) 


Complex-valued random variables and vectors 


As we shall see in later chapters, in applications such as channel equalization, array 
processing, etc., we encounter complex signal and noise models. To formulate these models, 
we need to describe complex random variables and vectors, and then extend our standard 
definitions and results to the complex case. A complex random variable is defined as’ 
x(€) = xR (6) + jxr(E), where xp (f) and xj(¢) are real-valued random variables. Thus we 
can think of x(¢) as a mapping from an abstract probability space S to a complex space C. 
Alternatively, x(¢) can be thought of as a real-valued random vector [xr (¢), x1(f)] T witha 
joint cdf Fy, x, (XR, xy) or a joint pdf f,, x, («1, x2) that will allow us to define its statistical 
averages. The mean of x(¢) is defined as 


E{x(f)} = My = E{xr(S) + Jx(S)} = Mag + Sly (3.2.8) 


and the variance is defined as 


a; = Blix) — 17} (3.2.9) 
which can be shown to be equal to 
oy = E{\x(6)7} = [mel (3.2.10) 


We will not make any distinction in notation between a real-valued and a complex-valued random variable. The 
actual type should be evident from the context. 


A complex-valued random vector is given by 
xR1(f) x1 (S) 
x(f) = xr (C) + Jxi(S) = | : eel (3.2.11) 
xroM(f) xim(S) 
and we can think of a complex-valued random vector as a mapping from an abstract proba- 
bility space to a vector-valued, complex space C™. The cdf for the complex-valued random 
vector x(¢) is then defined as 
Fx(x) = Pr{x(S) < x} = Pr(xr(S) < xr, x1(C) < x1} (3.2.12) 
while its joint pdf is defined as 
Pr{xr < Xr(¢) < Xr + Axr, x1 < x1(¢) < x1 + Axi} 


FulX) Agee 50 Axpy Axy- +: Axpy AXxtu 
. (3.2.13) 
Axtiyu—>0 
aa) 0 0 0 


= dna: @ F. 
OxR1 0X1, =OXRM OXIM =) 


From (3.2.13), the cdf is obtained by integrating the pdf over all real and imaginary parts, 
that is 


Ro = ff podarer do =f fx(v) dv (3.2.14) 


where the single integral in the last expression is used as a compact notation for multidi- 
mensional integrals and should not be confused with a complex-contour integral. These 
probability functions for a complex-valued random vector possess properties similar to 
those of the real-valued random vectors. In particular, 


7 f(x) dx = 1 (3.2.15) 


Statistical description 


Clearly the above probability functions require an enormous amount of information 
that is not easy to obtain or is too complex mathematically for practical use. In practical 
applications, random vectors are described by less complete but more manageable statistical 
averages. 


Mean vector. As we have seen before, the most important statistical operation is the 
expectation operation. The marginal expectation of a random vector x(¢) is called the mean 
vector and is defined by 


E{xi(f)} Ly 
My = E{x(G)} = |: =|: (3.2.16) 
E{xu(S)} Lu 


where the integral is taken over the entire C” space. The components of jt are the means 
of the individual random variables. 


Correlation and covariance matrices. The second-order moments of arandom vector 
x(¢) are given as matrices and describe the spread of its distribution. The autocorrelation 
matrix is defined by 


ry sts) TIM 
R, 4 E{x(¢)x# (-)} = |: fino (3.2.17) 


M1 °*** TMM 
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where superscript H denotes the conjugate transpose operation, the diagonal terms 
rit & E{\xi(O(7} i=1,...,M (3.2.18) 


: 2 
are the second-order moments, denoted earlier as 72 


; » of random variables x;(¢), and the 
off-diagonal terms 


rg ZEOxwOy=ari, iF (3.2.19) 


measure the correlation, that is, the statistical similarity between the random variables x; (¢) 
and x ;(¢). From (3.2.19) we note that the correlation matrix Rx is conjugate symmetric or 
Hermitian, that is, Ry = R!. 

The autocovariance matrix is defined by 


Yu ot") Yim 
Ty © E{[x() — axl XC) — axl} S|: aa ge (3.2.20) 
YMi YMM 
where the diagonal terms 
Vig = Eflai(C) — yj 17} i=1,...,M (3.2.21) 


are the (self-)variances of x;(¢) denoted earlier as oY, while the off-diagonal terms 
Vij = EDS) — milli 0) — BP} = Ela OF O)}-uiei sys, i AF (G.2.22) 


are the values of the covariance between x;(¢) and x;(¢). The covariance matrix Ty is 
also a Hermitian matrix. The covariance y;; can also be expressed in terms of standard 


deviations of x;(¢) and x;(¢) as Vij = PijFiPj; where 


= pji (3.2.23) 


is called the correlation coefficient between x;(¢) and x; (¢). Note that 


Ipyl<lo0 i AS hii =1 (3.2.24) 
The correlation coefficient measures the degree of statistical similarity between two random 
variables. If |o;;| = 1, then random variables are said to be perfectly correlated; but if 


Pij = 0 (that is, when the covariance Vij= 0), then x; (¢) and x ; (¢) are said to uncorrelated. 
The autocorrelation and autocovariance matrices are related. Indeed, we can easily see 
that 


Ty © E{[x(f) — pl lx(¢) — wy)7} = Re — aye (3.2.25) 


which shows that these two moments have essentially the same amount of information. In 
fact, if 4, = 0, then Ty = Ry. The autocovariance measures a weaker form of interaction 
between random variables called correlatedness that should be contrasted with the stronger 
form of independence that we described in (3.2.7). If random variables x;(¢) and x ;(¢) are 
independent, then they are also uncorrelated since (3.2.7) implies that 


E{xi(O)xj(Q)} = EOE} or yy =O (3.2.26) 


but uncorrelatedness does not imply independence unless random variables are jointly 
Gaussian (see Problem 3.15). The autocorrelation also measures another weaker form of 
interaction called orthogonality. Random variables x; (¢) and x ;(¢) are orthogonal if their 
correlation 


rg =E(ai(O)xFO}H=O IAS (3.2.27) 


Clearly, from (3.2.26) if one or both random variables have zero means, then uncorrelated- 
ness also implies orthogonality. 


We can also define correlation and covariance functions between two random vectors. 
Let x(¢) and y(¢) be random M- and L-vectors, respectively. Then the M x L matrix 


E{xi(Q)yp(Q)}o ES )y_O} 
Ryy & Efxy”} = |: oe (3.2.28) 
E{xm( yi} ++ Elxu yp} 


is called a cross-correlation matrix whose elements r;; are the correlations between random 
variables x;(¢) and y;(¢). Similarly the M x L matrix 


Tyy = E{[x(o)—mylly(0)—My]”} = Ray — axel (3.2.29) 


is called a cross-covariance matrix whose elements c;; are the covariances between x; (¢) 
and y;(¢). In general the cross-matrices are not square matrices, and even if M = L, they 
are not necessarily symmetric. Two random vectors x(¢) and y(¢) are said to be 


e Uncorrelated if 
Tyy = 0 => Ry = tyne (3.2.30) 
e Orthogonal if 
Ryy = 0 (3.2.31) 
Again, if fy or fly or both are zero vectors, then (3.2.30) implies (3.2.31). 


3.2.2 Linear Transformations of Random Vectors 


Many signal processing applications involve linear operations on random vectors. Linear 
transformations are relatively simple mappings and are given by the matrix operation 


y(f) = g[x(f)] = Ax(Z) (3.2.32) 


where A is an L x M (not necessarily square) matrix. The random vector y(¢) is completely 
described by the density function fy(y).If L > M, then only M y;(¢) random variables can 
be independently determined from x(¢). The remaining (L — M) y;(¢) random variables 
can be obtained from the first y;(¢) random variables. Thus we need to determine fy (y) 
for M random variables from which we can determine fy (y) for all L random variables. If 
M > L, then we can augment y into an M-vector by introducing auxiliary random variables 


yr+iS) = x141(C), ---s yu) = xu) (3.2.33) 


to determine fy(y) for M random variables from which we can determine fy(y) for the 
original L random variables. Therefore, for the determination of the pdf f(y), we will 
assume that L = M and that A is nonsingular. 

Furthermore, we will first consider the case in which both x(¢) and y(¢) are real- 
valued random vectors, which also implies that A is a real-valued matrix. This approach is 
necessary because the complex case leads to a slightly different result. Then the pdf f(y) 
is given by 


(g'(y)) 
A= 2 ; I . (3.2.34) 
where J is called the Jacobian of the transformation (3.2.32), given by 
ay, 8yM 
Ox] Ox] 
J = det | : Tet =detA (3.2.35) 


OxM OxM 
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From (3.2.34) and (3.2.35), the pdf of y(¢) is given by 


fx(A7ly) 
| det A| 


from which moment computations of any order of y(¢) can be performed. Now we consider 
the case of the complex-valued random vectors. Then by applying the above approach to 
both real and imaginary parts, the result (3.2.36) becomes 


fx(Aly) 
| det A|? 


This shows that sometimes we can get different results depending upon whether we assume 
real- or complex-valued random vectors in our analysis. 

Determining fy(y) is, in general, tedious except in the case of Gaussian random vectors, 
as we shall see later. In practice, the knowledge of Hy, Ty, Ixy, or D'yx is sufficient in many 
applications. If we take the expectation of both sides of (3.2.32), we find that the mean 
vector is given by 


fyY) = 


real-valued random vector (3.2.36) 


SyY) = 


complex-valued random vector (3.2.37) 


My = Efy(¢)} = E{Ax(¢)} = AE{x(¢)} = Any (3.2.38) 
The autocorrelation matrix of y(¢) is given by 
Ry = Efyy”} = E{Axx"A”} = AE{xx”}A” = AR,A? (3.2.39) 
Similarly, the autocovariance matrix of y(¢) is given by 
Ty = ATxA” (3.2.40) 
Consider the cross-correlation matrix 
Ryy = E{x(g)y” (¢)} = E{x(g)x" (C)A"} (3.2.41) 
= E{x(¢)x"(c)}A”% = Rx AY (3.2.42) 


and hence Ryx = ARx. Similarly, the cross-covariance matrices are 


Txyy=TyxA" and = Ty =ATy (3.2.43) 


3.2.3 Normal Random Vectors 


If the components of the random vector x(¢) are jointly normal, then x(¢) is a normal 
random M-vector. Again, the pdf expressions for the real- and complex-valued cases are 
slightly different, and hence we consider these cases separately. The real-valued normal 
random vector has the pdf 


1 1 
fx(X) = Oxy HE, exp| 5% pa) T= 1) real (3.2.44) 


with mean x and covariance Ix. It will be denoted by V(x, I’). The term in the exponent 
(x -— Mx) Ty '(x — ty) is a positive definite quadratic function of x; and is also given by 


M M 
(x = ay) Ty = wy) = 0 Oy ii — wy - BY) (3.2.45) 
t= jst 


where (Ty i );j denotes the (7, j)th element of Ty ! The characteristic function of the normal 


random vector is given by 
Ox(€) = exp(jé" wy — 58 Tx8) (3.2.46) 


where &7 = [&,,...,€y]. 


The complex-valued normal random vector has the pdf 


fx(X) = exp[—(x — #y)"Tyo'(x—wx)] complex (3.2.47) 


1 
mM IP y| 


with mean sl, and covariance Ty. This pdf will be denoted by CV (wy, Tx). If x(¢) is a 
scalar complex-valued random variable x(¢) with mean jz, and variance ae. then (3.2.47) 
reduces to 


1 Ix — wl? 
K@)= 7 exp 5 (3.2.48) 
o* ot 
which should be compared with the pdf given in (3.1.37). Note that the pdf in (3.1.37) 
is not obtained by setting the imaginary part of x(¢) in (3.2.48) equal to zero. For a 
more detailed discussion on this aspect, see Therrien (1992) or Kay (1993). The term 
(x — py)! ry lx - /tx) in the exponent of (3.2.47) is also a positive definite quadratic 
function and is given by 


M M 
(x= wy) TST wy) = Oy iu — wy — #) (3.2.49) 
i=1 j=1 


The characteristic function for the complex-valued normal random vector is given by 


x (&) = exp jRe(E” wy)— 78" Tx8] (3.2.50) 


The normal distribution is a useful model of a random vector because of its many 
important properties: 


1. The pdf is completely specified by the mean vector and the covariance matrix, which are 
relatively easy to estimate in practice. All other higher-order moments can be obtained 
from these parameters. 

2. If the components of x(¢) are mutually uncorrelated, then they are also independent. 
(See Problem 3.15.) This is useful in many derivations. 

3. A linear transformation of a normal random vector is also normal. This can be easily 
seen by using (3.2.38), (3.2.40), and (3.2.44) in (3.2.36); that is, for the real-valued case 
we obtain 


1 
fy(y) = exp| ae My) Thy - H5)| real (3.2.51) 


1 
(27) M/?\Py|1/2 


This result can also be proved by using the moment generating function in (3.2.46) (see 
Problem 3.6). Similarly for the complex-valued case, from (3.2.37) and (3.2.47) we 
obtain 


fy(y) = exp[—(y — ty)" (A7')"PL'AT!(y — py] complex (3.2.52) 


1 
mM ITy| 
4. The fourth-order moment of a normal random vector 


x(£) = [x1 (6) x2(C) x3(6) xa(o)]" 


can be expressed in terms of its second-order moments. For the real case, that is, when 
x(¢) ~ N(O, Ix), we have 
E{x1(G)x2($)x3(S)xa(S)} = E {xi ()x2(S)} E {x3 (G ) x4 (0 )} 
+ E{x1(S)x3(S) JE {x2 (F)x4(S)} (3.2.53) 
+ E{xi(O)xa (SpE {x2 (5.13 (S)} 
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For the complex case, that is, when x(¢) ~ CN'(0, Tx), we have 
E{xt (6)x2(0)x3 (6 )x4a(S)} = Eat (6) x2(S) FE {x3 (6)x4(O)} 
+ E{xt (6) xa(C)}E {x2 (0)x3 (S)} 


The proof of (3.2.53) is tedious but straightforward. However, the proof of (3.2.54) is 
complicated and is discussed in Kay (1993). 


(3.2.54) 


3.2.4 Sums of Independent Random Variables 


In many applications, a random variable y(¢) can be expressed as a linear combination of 
M statistically independent random variables {xx (¢)}” , that is, 


M 
y(E) = crx (f) + cox0(Z) +++ +emxu(S) = Yo cexe(S) (3.2.55) 
k=1 


where {cx}4 is a set of fixed coefficients. In these situations, we would like to compute 
the first two moments and the pdf of y(¢). The moment computation is straightforward, 
but the pdf computation requires the use of characteristic functions. When these results are 
extended to the sum of an infinite number of statistically independent random variables, 
we obtain a powerful theorem called the central limit theorem (CLT). Another interesting 
concept develops when the sum of IID random variables preserves their distribution, which 
results in stable distributions. 


Mean. Using the linearity of the expectation operator and taking the expectation of 
both sides of (3.2.55), we obtain 


M 


by = »- Ck x, (3.2.56) 
k=1 


Variance. Again by using independence, the variance of y(¢) is given by 
2 


2 
o,F=E 


M M 
Do cele) — My ]) f= Do lexl?o%, (3.2.57) 
k=1 k=1 


where we have used the statistical independence between random variables. 


Probability density function. Before we derive the pdf of y(¢) in (3.2.55), we consider 
two special cases. First, let 


y(O) = x1(6) + x2(C) (3.2.58) 


where x; (¢) and x2(¢) are statistically independent. Then its characteristic function is given 
by 


y(E) = E{el59@)) = E{efl@tnOh Eel @) plei6x20)) (3.2.59) 
where the last equality follows from the independence. Hence 
y(E) = Oy, (€) 4, (E) (3.2.60) 
or from the convolution property of the Fourier transform 
AO) = far) * fo) (3.2.61) 


From (3.2.60) the second characteristic function of y(¢) is given by 
Wy(E) = Wx, (E) + Wx, (E) (3.2.62) 


or the mth-order cumulant of y(¢) is given by 


(m) _ ,-(m) (m) 
Ky =Ky Ky, 


(3.2.63) 


These results can be easily generalized to the sum of M independent random variables. 


EXAMPLE 3.2.1. Let {x, (ia be four IID random variables uniformly distributed over 


[—0.5, 0.5]. Compute and plot the pdfs of yyy (¢) £ ba xx for M = 2, 3, and 4. Compare 
these pdfs with that of a zero-mean Gaussian random variable. 


Solution. Let f (x) be the pdf of a uniform random variable over [—0.5, 0.5], that is, 


f(x) ia eae (3.2.64) 
x _ Le 
0 otherwise 
Then from (3.2.61) 
l+y -l<y<0 
froO=fM*fOV=4l-y Osysl (3.2.65) 
0 otherwise 
Similarly, we have 
1 3\2 3 1 
a + 5) So SYS a 
3). 2 al 2 y< 1 
fx) = frod*fo=y4 2 7 (3.2.66) 
ly-3? Jle«y<3 
2 2 2. er) 
0 otherwise 
507 +2) —2<y<-l 
—$y-y+% -l<y<0 
and fi =fu Oe fo) = 5 ce ty -y 0<y<1 (3.2.67) 
—( 2+y) l<y<2 
0 otherwise 


The plots of fy, (y), fy;(¥), and fy,(y) are shown in Figure 3.4 along with the zero-mean 
Gaussian pdf. The variance of the Gaussian random variable is chosen so that 99.92 percent of 
the pdf area is over [—2, 2]. We observe that as M increases, the pdf plots appear to get closer 
to the shape of the Gaussian pdf. This observation will be explored in detail in the CLT. 


Next, let y(¢) = ax(¢) + 5; then the characteristic function of y(¢) is 


®,(E) = E{ellax@+olsy — E{elt*@ eib&y = ®,(aké)el§ 


and by using the properties of the Fourier transform, the pdf of y(¢) is given by 


1 —b 
fro) =—h (: 
a 


la| 


From (3.2.68), the second characteristic function is given by 


Wy (&) = WU, (a&) + jbg 


and the cumulants are given by 
dW, (&) 
(m) _ ¢_ sym © * YAMS? ee 7 ee 
ky =(-J) de” Lage (7) 


m ,.(m) 


=a Ky m> 1 


(3.2.68) 
) (3.2.69) 
(3.2.70) 

n STU) 
dé” Je=0 (3.2.71) 
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M=4 N (0, 0.6) 


-2 0 2 
(c) (d) 
FIGURE 3.4 


The pdf plots of (a) sum of two, (b) sum of three, (c) sum of four, and (d) Gaussian random 


variables in Example 3.2.1. 


Finally, consider y(¢) in (3.2.55). Using the results in (3.2.60) and (3.2.68), we have 


M 
©y(E) = | | Ox, (cee) 


k=1 


from which the pdf of y(¢) is given by 


1 1 1 
AO = fn (2) fin (2) ++" fru (2) 
Ici| a Ico c2 lcu| CM 


From (3.2.62) and (3.2.70), the second characteristic function is given by 


M 
Wy(E) = > Wy, (ceé) 


k=1 


and hence from (3.2.63) and (3.2.71), the cumulants of y(¢) are 


M 
ra = x ee 
k=1 


where c; is the mth power of cx. 


(3.2.72) 


(3.2.73) 


(3.2.74) 


(3.2.75) 


In the following two examples, we consider two interesting cases in which the sum of 
IID random variables retains their original distribution. The first case concerns Gaussian 
random variables that have finite variances while the second case involves Cauchy random 


variables that possess infinite variance. 


EXAMPLE 3.2.2. Let xp (6) ~ N (ux, 02), k = 1,...,M and let y(k) = x ao): The 
characteristic function of x, (¢) is 


vipa 
©, (&) = exp (is . | 


and hence from (3.2.72), we have 


M 

2 2 

yeep, mF a: 
2 


i Da - S 
k=1 
which means that y(¢) is also a Gaussian random variable with mean Sey [4x and variance 
yy of, that is, y(¢) ~ Noo Lk: yi a). In particular, if the xz, (¢) are IID with a pdf 
N (1, 0”), then 


e2Mo2 £2g2 
y(&) = exp ( jMug —-—— } = exp] M [ iu — ~— (3.2.76) 


This behavior of y(¢) is in contrast with that of the sum of the IID random variables in Exam- 
ple 3.2.1 in which the uniform pdf changed its form after M-fold convolutions. 


EXAMPLE 3.2.3. As a second case, consider M IID random variables {xe (6) , with Cauchy 
distribution 


&_—** > 

m (x —a)? + ? 

and let y(k) = = xx(¢). Then from (3.1.43), we have 
®x(§) = exp(jag — BIé|) 


Sox (x) = 


and hence 


Py(§) = exp(j Mas — MBI§|) = exp[M (jas — BI§|)] (3.2.77) 


This once again shows that the sum random variable has the same distribution (up to a scale 
factor) as that of the individual random variables, which in this case is the Cauchy distribution. 


From these examples, we note that the Gaussian and the Cauchy random variables 
are invariant, or that they have a “self-reproducing” property under linear transformations. 
These two examples also raise some interesting questions. Are there any other random vari- 
ables that possess this invariance property? If such random variables exist, what is the form 
of their pdfs or, alternatively, of their characteristic functions, and what can we say about 
their means and variances? From (3.2.76) and (3.2.77), observe that if the characteristic 
function has a general form 


®,, (§) =a" (3.2.78) 
where a is some constant and 6(&) is some function of €, then we have 
@, (5) = a9) (3.2.79) 


that is, the characteristic function of the sum has the same functional form except for a 
change in scale. Are Gaussian and Cauchy both special cases of some general situation? 
These questions are answered by the concept of stable (more appropriately, linearly invariant 
or self-reproducing) distributions. 


Stable distributions. These distributions satisfy the “stability” property, which in sim- 
ple terms means that the distributions are preserved (or that they self-reproduce) under 
convolution. The only stable distribution that has finite variance is the Gaussian distri- 
bution, which has been well understood and is used extensively in the literature and in 


93 


SECTION 3.2 
Random Vectors 


94 


CHAPTER 3 
Random Variables, 
Vectors, and Sequences 


practice. The remaining stable distributions have infinite variances (and in some cases, 
infinite means) which means that the corresponding random variables exhibit large fluctua- 
tions. These distributions can then be used to model signals with large variability and hence 
are finding increasing use in many diverse applications such as the gravitational fields of 
stars, temperature distributions in a nuclear reaction, or stock market fluctuations (Lamperti 
1996; Samorodnitsky and Taqqu 1994; Feller 1966). 

Before we formally define stable distributions, we introduce the following notation for 
convenience 


y() 4 x(o) (3.2.80) 


to indicate that the random variables x (¢) and y(¢) have the same distribution. For example, 
if y(¢) = ax(¢) +b, we have 


Fy(y) = Fy (=*) (3.2.81) 


and therefore x (¢) £ ax(€) +b. 


DEFINITION 3.2. Let x1 (€), x2(¢), ..., xg (€) be ID random variables with a common distri- 
bution Fy (x) and let syy(C) = x1 (€) +---+xyy(C) be their sum. The distribution F(x) is said 
to be stable if for each M there exist constants ayy > 0 and by such that 


d 
sm (o) = ayx(o) + by (3.2.82) 
and that F(x) is not concentrated at one point. 


If (3.2.82) holds for by = 0, we say that F; (x) is stable in the strict sense. The condition 
that F,.(x) is not concentrated at one point is necessary because such a distribution is always 
stable. Thus it is a degenerate case that is of no practical interest. A stable distribution is 
called symmetric stable if the distribution is symmetric, which also implies that it is strictly 
stable. 

Itcan be shown that for any stable random variable x (¢) there is anumbera,0 < a < 2, 
such that the constant ay in (3.2.82) is ay = M'/“. The number a is known as the index 
of stability or characteristic exponent. A stable random variable x (¢) with index a@ is called 
a stable. 

Since there is no closed-form expression for the probability density function of stable 
random variables, except in special cases, they are specified by their characteristic function 
®(&). This characteristic function is given by 


; Ta 
expt jug — |o6|" [1 — jBsign(é)tan(>-)} a AI 


@(§) = (3.2.83) 


2 
exp{ jug — log |* -[1 — jp (=) sign(§)In|§|]} a@=1 


where sign(€) = &/|&| if € # 0 and zero otherwise. We shall use the notation Sy(o, 6, i) 
to denote the stable random variable defined by (3.2.83). The parameters in (3.2.83) have 
the following meaning: 


1. The characteristic exponent a, 0 < a < 2, determines the shape of the distribution and 
hence the flatness of the tails. 

2. The skewness (or alternatively, symmetry) parameter 8, —1 < 6 < 1, determines the 
symmetry of the distribution: 6 = 0 specifies a symmetric distribution, B < 0 a left- 
skewed distribution, and B > 0 a right-skewed distribution. 

3. The scale parameter 0,0 < o < ov, determines the range or dispersion of the stable 
distribution. 

4. The location parameter jz, —oo < 4 < 00, determines the center of the distribution. 


We next list some useful properties of stable random variables. 95 
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1. For 0 <a < 2, the tails of a stable distribution decay as a power law, that is, 
Random Vectors 


Pr[|x() — wl > x] c as x —> 00 (3.2.84) 


where C is a constant that depends on the scale parameter o. As a result of this behavior, 
a-stable random variables have infinite second-order moments. In particular, 
E{|x()|?} < 00 foranyO < p<a@ 
(lx)? 3 OS pe (3.2.85) 
E{|x(¢)|?} = co for any p > a 
Also var[x(¢)] = oo for0 < @ < 2, and E{|x(f)|} = wif0<a <1. 
2. Astable distribution is symmetric about yu iff 8 = 0. Asymmetric a-stable distribution 
is denoted as SaS, and its characteristic function is given by 


©(&) = exp(jué — |o€ |") (3.2.86) 


3. If x(f) is SaS with a = 2 in (3.2.83), we have a Gaussian distribution with variance 
equal to 207, that is, N(jz, 207), whose tails decay exponentially and not as a power 
law. Thus, the Gaussian is the only stable distribution with finite variance. 

4. If x(¢) is SaS with a = 1, we have a Cauchy distribution with density 


o/1 
(=p) +o? 
A standard (4 = 0,0 = 1) Cauchy random variable x(¢) can be generated from a [0, 1] 
uniform random variable u(¢), by using the transformation x = tan[z(u — sl. 


iQ) = (3.2.87) 


5. If x(¢) is SaS witha = 7 we have a Levy distribution, which has both infinite variance 
and infinite mean. The pdf of this distribution does not have a functional form and hence 


must be computed numerically. 


In Figure 3.5, we display characteristic and density functions of Gaussian, Cauchy, and 
Levy random variables. The density plots were computed numerically using the MATLAB 
function stablepdf. 


Infinitely divisible distributions. A distribution F;,(x) is infinitely divisible if and only 
if for each M there exists a distribution Fyy(x) such that 


fx (x) = fu (x) * fux) * +++ * fu) (3.2.88) 
or by using the convolution theorem, 
1 (E) = Ou (E) Ou(E) «+» Ou E) = OYE) (3.2.89) 


that is, for each M the random variable x(¢) can be represented as the sum x(¢) = x1(¢) + 
+++ + xy(¢) of M IID random variables with a common distribution Fy (x). Clearly, all 
stable distributions are infinitely divisible. An example of an infinitely divisible pdf is shown 
in Figure 3.6 for M = 4, a = 1.5, wu = 0, and B = 0. 


Central limit theorem. Consider the random variable y(¢) defined in (3.2.55). We 
would like to know about the convergence of its distribution as M — oo. If y(¢) is asum 
of IID random variables with a stable distribution, the distribution of y(¢) also converges 
to a stable distribution. What result should we expect if the individual distributions are not 
stable and, in particular, are of finite variance? As we observed in Example 3.2.1, the sum 
of uniformly distributed independent random variables appears to converge to a Gaussian 
distribution. Is this result valid for any other distribution? The following version of the CLT 
answers these questions. 


96 a=2,0=0.7071,B=y=0 Gaussian pdf 
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The characteristic and density function plots of Gaussian, Cauchy, and Levy random variables. 


THEOREM 3.1 (CENTRAL LIMIT THEOREM). Let eau Co) pa be a collection of random 


variables such that x1 (¢), x2(¢), ..., x (f) (a) are mutually independent and (b) have the same 
distribution, and (c) the mean and variance of each random variable exist and are finite, that is, 
Hy, < 0 and 3, < oo for allk = 1,2,..., M. Then, the distribution of the normalized sum 


M 
So xO) = My y 


ym ($) = = 
YM 


approaches that of a normal random variable with zero mean and unit standard deviation as 
M>o. 


Proof. See Borkar (1995). 


Comments. The following important comments are in order regarding the CLT. 


1. Since we are assuming IID components in the normalized sum, the above theorem is 
known as the equal-component case of the CLT. 

2. It should be emphasized that the convergence in the above theorem is in distribution 
(cdf) and not necessarily in density (pdf). Suppose we have M discrete and HD random 
variables. Then their normalized sum will always remain discrete no matter how large 
M is, but the distribution of the sum will converge to the the integral of the Gaussian pdf. 


a=1.5,pB =0 Infinitely Divisible pdf 
1 ] 0.2874 F J 
S gS 
6 & 
0 0 : 
-5§ 0 5 -5 0 5 
M=4 Component pdf: M = 4 
T T 
: 1 0.703 + 4 
g 3 
é 3 
0 ; 0 t 
-5 0 5 -5 0 5 
é x 
FIGURE 3.6 


The characteristic and density function plots of an infinitely divisible distribution. 


3. The word central in the CLT is a reminder that the distribution converges to the Gaussian 
distribution around the center, that is, around the mean. Note that while the limit distri- 
bution is found to be Gaussian, frequently the Gaussian limit gives a poor approximation 
for the tails of the actual distribution function of the sum when M is finite, even though 
the actual value under consideration might seem to be quite large. 

4. As a final point, we note that in the above theorem the assumption of finite variance is 
critical to obtain a Gaussian limit. This implies that a// distributions with finite variance 
will converge to the Gaussian when independent copies of their random variables are 
added. What happens if the variance is infinite? Then in this case the sum converges 
to one of the stable distributions. For example, as shown in Example 3.2.3, the sum of 
Cauchy random variables converges to a Cauchy distribution. 


3.3 DISCRETE-TIME STOCHASTIC PROCESSES 


After studying random variables and vectors, we can now extend these concepts to discrete- 
time signals (or sequences). Many natural sequences can be characterized as random signals 
because we cannot determine their values precisely, that is, they are unpredictable. A nat- 
ural mathematical framework for the description of these discrete-time random signals is 
provided by discrete-time stochastic processes. 

To obtain a formal definition, consider an experiment with a finite or infinite number 
of unpredictable outcomes from a sample space S = {f1, f5,...}, each occurring with 
a probability Pr{g,},k = 1,2,.... By some rule we assign to each element ¢; of S a 
deterministic sequence x(n, €,), —-OO < n < oo. The sample space S, the probabilities 
Pr{¢,}, and the sequences x(n, €,), —-OO <n < 8, constitute a discrete-time stochastic 
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process or random sequence. Formally, 


x(n, 6), -—CO <n < ©, is arandom sequence if for a fixed value no of n, x(no, f) 
is a random variable. 


The set of all possible sequences {x(n, ¢)} is called an ensemble, and each individual 
sequence x(n, ¢;), corresponding to a specific value of ¢ = ¢;, is called a realization or a 
sample sequence of the ensemble. 

There are four possible interpretations of x(n, ¢), depending on the character of n and 
¢, as illustrated in Figure 3.7: 


e x(n, €) is arandom variable if 1 is fixed and ¢ is a variable. 
e x(n, €) is a sample sequence if ¢ is fixed and n is a variable. 
e x(n, €) is anumber if both and ¢ are fixed. 

e x(n, €) is a stochastic process if both n and ¢ are variables. 


Abstract space Real space 
x(n, ) 
[itr Tite, 
Rs ig ese’ 
‘eal 
ae Lop 


ss ee tit, oT, 


ONHNDOORTD BJZoasony 


Random variable x(n, ¢) 


FIGURE 3.7 
Graphical description of random sequences. 


A random sequence is also called a time series in the statistics literature. It is a sequence 
of random variables, or it can be thought of as an infinite-dimensional random vector. 
As with any collection of infinite objects, one has to be careful with the asymptotic (or 
convergence) properties of a random sequence. If 1 is a continuous variable taking values 
in R, then x(n, €) is an uncountable collection of random variables or an ensemble of 
waveforms. This ensemble is called a continuous-time stochastic process or a random 
process. Although these processes can be handled similarly to sequences, they are more 
difficult to deal with in a rigorous mathematical manner than sequences are. Furthermore, 
practical signal processing requires discrete-time signals. Hence in this book we consider 
random sequences rather than random waveforms. 

Finally, in passing we note that the word stochastic is derived from the Greek word 
stochasticos, which means skillful in aiming or guessing. Hence, the terms random process 
and stochastic process will be used interchangeably throughout this book. 


As mentioned before, a deterministic signal is by definition exactly predictable. This 
assumes that there exists a certain functional relationship that completely describes the 
signal, even if this relationship is not available. The unpredictability of a random process 
is, in general, the combined result of two things. First, the selection of a single realization is 
based on the outcome of arandom experiment. Second, no functional description is available 
for all realizations of the ensemble. However, in some special cases, such a functional 
relationship is available. This means that after the occurrence of a specific realization, its 
future values can be predicted exactly from its past ones. If the future samples of any 
realization of a stochastic process can be predicted from the past ones, the process is called 
predictable or deterministic; otherwise, it is said to be a regular process. For example, 
the process x(n, €) = c, where c is a random variable, is a predictable stochastic process 
because every realization is a discrete-time signal with constant amplitude. In practice, we 
most often deal with regular stochastic processes. 

The simplest description of any random signal is provided by an amplitude-versus-time 
plot. Inspection of this plot provides qualitative information about some significant features 
of the signal that are useful in many applications. These features include, among others, the 
following: 


1. The frequency of occurrence of various signal amplitudes, described by the probability 
distribution of samples. 

2. The degree of dependence between two signal samples, described by the correlation 
between them. 

3. The existence of “cycles” or quasi-periodic patterns, obtained from the signal power 
spectrum (which will be described in Section 3.3.6). 

4. Indications of variability in the mean, variance, probability density, or spectral content. 


The first feature above, the amplitude distribution, is obtained by plotting the histogram, 
which is an estimate of the first-order probability density of the underlying stochastic pro- 
cess. The probability density indicates waveform features such as “spikiness” and bounded- 
ness. Its form is crucial in the design of reliable estimators, quantizers, and event detectors. 

The dependence between two signal samples (which are random variables) is given 
theoretically by the autocorrelation sequence and is quantified in practice by the empirical 
correlation (see Chapter 1), which is an estimate of the autocorrelation sequence of the 
underlying process. It affects the rate of amplitude change from sample to sample. 

Cycles in the data are related to sharp peaks in the power spectrum or periodicity in 
the autocorrelation. Although the power spectrum and the autocorrelation contain the same 
information, they present it in different fashions. 

Variability in a given quantity (e.g., variance) can be studied by evaluating this quantity 
for segments that can be assumed locally stationary and then analyzing the segment-to- 
segment variation. Such short-term descriptions should be distinguished from long-term 
ones, where the whole signal is analyzed as a single segment. 

All the above features, to a lesser or greater extent, are interrelated. Therefore, it is 
impossible to point out exactly the effect of each one upon the visual appearance of the signal. 
However, a lot of insight can be gained by introducing the concepts of signal variability 
and signal memory, which are discussed in Sections 3.3.5 and 3.4.3 respectively. 


3.3.1 Description Using Probability Functions 


From Figure 3.7, it is clear that at n = ng, x(o, ¢) is a random variable that requires a 
first-order probability function, say cdf F(x; no), for its description. Similarly, x(11, ¢) 
and x (2, ¢) are joint random variables at instances n; and n2, respectively, requiring a joint 
cdf Fy (x1, x23 11, N2). Stochastic processes contain infinitely many such random variables. 
Hence they are completely described, in a statistical sense, if their kth-order distribution 
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function 
Fy(X1,.--,Xk3 M1, ++. Mk) = Pr{x(m1) < x1,...,x(Mk) S xx} (3.3.1) 
is known for every value of k > 1 and for all instances nj, n2,..., x. The kth-order pdf is 
given by 
2k F so ee KES Ns 3s 
FCoE AE LE ia. 1G 85) 


OXRI +++ OXTK 

Clearly, the probabilistic description requires a lot of information that is difficult to 
obtain in practice except for simple stochastic processes. However, many (but not all) 
properties of a stochastic process can be described in terms of averages associated with its 
first- and second-order densities. 

For simplicity, in the rest of the book, we will use a compact notation x (1) to represent 
either a random process x(n, ¢) or a single realization x(n), which is a member of the 
ensemble. Thus we will drop the variable ¢ from all notations involving random variables, 
vectors, or processes. We believe that this will not cause any confusion and that the exact 
meaning will be clear from the context. Also the random process x(m) is assumed to be 
complex-valued unless explicitly specified as real-valued. 


3.3.2 Second-Order Statistical Description 


The second-order statistic of x(n) at time n is specified by its mean value t,(n) and its 
variance ot (n), defined by 


My(n) = E{x(n)} = E{xr(n) + jxy(n)} (3.3.3) 

and ox (n) = Eflx(n) — wy?) = Elle @)7} — [HI (3.3.4) 
respectively. Note that both ww, (”) and 0, (7) are, in general, deterministic sequences. 

The second-order statistics of x(n) at two different times 1; and n2 are given by the two- 

dimensional autocorrelation (or autocovariance) sequences. The autocorrelation sequence 

of a discrete-time random process is defined as the joint moment of the random variables 

x(n1) and x(n2), that is, 
Pex (1,2) = E{x(n)x*(n2)} (3.3.5) 


It provides a measure of the dependence between values of the process at two different 
times. In this sense, it also provides information about the time variation of the process. 
The autocovariance sequence of x(n) is defined by 


Vxx (M1, 12) = E{[x(m1) — wy (mi) Ix (2) — wy, (02) )*} 
= xx (1,2) — [Hy (11) Le (n2) 
We will use notations such as y , (11, 12), rx (m1, 12), y (M1, N2), Or r(m1, N2) When there is 
no confusion as to which signal we are referring. Note that, in general, the second-order 
statistics are defined on a two-dimensional grid of integers. 
The statistical relation between two stochastic processes x(n) and y(m) that are jointly 
distributed (i.e., they are defined on the same sample space S) can be described by their 
cross-correlation and cross-covariance functions, defined by 


rey (my, 2) = E{x(n)y"(n2)} (3.3.7) 
and Yxy(m1,n2) = E{[x(11) — hx) Ly@2) — Hy (m2) °F 


= rxy(11, 12) — My (M1) HY (12) 


(3.3.6) 


(3.3.8) 


The normalized cross-correlation of two random processes x(n) and y(n) is defined by 
Y xy (M1, 12) 


a 3.3.9 
Ox (N1)oy(N2) ? 


Pry, n2) = 


Some definitions 
We now describe some useful types of stochastic processes based on their statistical 
properties. A random process is said to be 


e An independent process if 
Se (X1, 6+, Xk M1, Mk) = fics mi) +++ frre ng) = Vk, nj,i = 1,...,k (3.3.10) 


that is, x(n) is a sequence of independent random variables. If all random variables have 
the same pdf f(x) for all k, then x(m) is called an IID (independent and identically 
distributed) random sequence. 

An uncorrelated process if x(n) is a sequence of uncorrelated random variables, that is, 


o2(n1) ny =n2 
Yx(m1,n2)= 47 = 07 (11)5(n1 — n2) (3.3.11) 

0 ny #Nn2 

Alternatively, we have 
) 2 
+ = 
a o*(n}) eee ny =n (3.3.12) 
[Ly (11) Ly (2) ny #n2 


An orthogonal process if it is a sequence of orthogonal random variables, that is, 

a(n) +|ex(m)P mp = 12 

0 ny #Nng2 

An independent increment process if Vk > 1 and Wn, < nz <--- < nx, the increments 
{x(m1)}, {x(22) — x(a}, .- «5 (x(k) — x(MK-1)} 


are jointly independent. For such sequences, the kth-order probability function can be 
constructed as products of the probability functions of its increments. 
A wide-sense periodic (WSP) process with period N if 


Ly (n) = U,(n+ N) Vn (3.3.14) 
and ry(nj,n2) =ry(ny + N,n2) =Pry(n1,n2 +N) =ry(nyt+N,no+N)_ (3.3.15) 


rx (11,12) = | | = E{|x(m1)|?}5(11 — nz) (3.3.13) 


Note that in the above definition, jz, (7) is periodic in one dimension while r, (11, m2) is 
periodic in two dimensions. 
A wise-sense cyclostationary process if there exists an integer N such that 


Ly, (n) = w,(n+ N) Vn (3.3.16) 
and rx(N1,N2) =rx(nj + N,n2+N) (3.3.17) 


Note that in the above definition, r, (m1, 12) is not periodic in a two-dimensional sense. 
The correlation sequence is invariant to shift by N in both of its arguments. 

If all kth-order distributions of a stochastic process are jointly Gaussian, then it is called 
a Gaussian random sequence. 


We can also extend some of these definitions to the case of two joint stochastic processes. 
The random processes x(n) and y(7) are said to be 


e Statistically independent if for all values of ny and n2 
fry, y31,N2) = frei ny fyO; n2) (3.3.18) 
e Uncorrelated if for every n; and n2 
Vxy(1,n2) =O or rey (1,22) = My (1) Wy (n2) (3.3.19) 
¢ Orthogonal if for every n; and n2 


ryy (M1, nz) =0 (3.3.20) 
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3.3.3 Stationarity 


A random process x(n) is called stationary if statistics determined for x(n) are equal to 
those for x(n + k), for every k. More specifically, we have the following definition. 


DEFINITION 3.3 (STATIONARY OF ORDER WN). Astochastic process x(n) is called stationary 
of order N if 


Se Q1,---5XN31,---,9N) = fxQq,..-, XN NT +h,...,nyN +k) (3.3.21) 


for any value of k. If x() is stationary for all orders N = 1, 2,..., it is said to be strict-sense 
stationary (SSS). 


An IID sequence is SSS. However, SSS is more restrictive than necessary for most 
practical applications. A more relaxed form of stationarity, which is sufficient for practical 
problems, occurs when a random process is stationary up to order 2, and it is also known 
as wide-sense stationarity. 


DEFINITION 3.4 (WIDE-SENSE STATIONARITY). Arandom signal x(n) is called wide-sense 
stationary (WSS) if 


1. Its mean is a constant independent of n, that is, 

E{x(n)} = by (3.3.22) 
2. Its variance is also a constant independent of n, that is, 

var[x(n)] = 0% (3.3.23) 


and 
3. Its autocorrelation depends only on the distance / = ny, — nz, called lag, that is, 


rx (Ny, N2) = rx (ny — 12) = ry (l) = Efx(n+)x*(n)} = E{x(n)x*(n—D} (3.3.24) 
From (3.3.22), (3.3.24), and (3.3.6) it follows that the autocovariance of a WSS signal 


also depends only on / = n; — nz, that is, 


2 

¥x@) = rx) — [Me (3.3.25) 
EXAMPLE 3.3.1. Let w(n) bea zero-mean, uncorrelated Gaussian random sequence with variance 
o2(n) =1. 
a. Characterize the random sequence w(n). 
b. Define x(n) = w(n) + w(n — 1), —co < n < oo. Determine the mean and autocorrelation 

of x(n). Also characterize x(n). 

Solution. Note that the variance of w(n) is a constant. 


a. Since uncorrelatedness implies independence for Gaussian random variables, w(n) is an in- 
dependent random sequence. Since its mean and variance are constants, it is at least stationary 
in the first order. Furthermore, from (3.3.12) or (3.3.13) we have 


rw(n1, 2) = 078(ny — no) = 8(ny — 19) 


Hence w(n) is also a WSS random process. 
b. The mean of x(n) is zero for all n since w(n) is a zero-mean process. Consider 


rx(n1,n2) = E{x(n1)x(n2)} 
= Ef{[w(n) + wy — I ][w(2) + wing — 1)]} 
=Fy(n1,N2) trw(,2 — 1) + rw(my — 1,72) 
+rw(n1 — 1,2 -1) 
= 028(ny — nz) +025(ny — no +1) 
+078(ny —1—1ny) +078(ny — 1 — 19 +1) 
= 28(nj — ng) + d(ny — 12+ 1) +5(11 — 712 - 1) 


Clearly, ry (m1, n2) is a function of ny — nz. Hence 
ry(@) = 260) + 670 +1) + 60-1) 
Therefore, x(n) is a WSS sequence. However, it is not an independent random sequence since 


both x(n) and x(n + 1) depend on w(n). 


EXAMPLE 3.3.2 (WIENER PROCESS). Toss a fair coin at each n, —oo <n < ov. Let 


(n) +S if heads is outcome Pr(H) = 0.5 
WwW — 
‘ -S if tails is outcome Pr(T) = 0.5 


where S is a step size. Clearly, w(n) is an independent random process with 
E{w(n)} =0 
1 1 
and E(w2(n)} = 02, = 8? (5) + $2 (5) = §2 
Define a new random process x(n),n > 1, as 


x(1) = w(1) 
x(2) = x(1) + w(2) = wl) + w(2) 


n 
x(n) = x(n — 1) + w(n) =) wl) 
i=l 
Note that x(n) is a running sum of independent steps or increments; thus it is an independent 
increment process. Such a sequence is called a discrete Wiener process or random walk. We can 
easily see that 


E{x(n)}=E} > w(i)t =0 


i=1 


and EVx(n)}= EY Dw Yo wh} =£4 >) Yo ww 
k=1 


i=l f=L4S1 


n n n 
=> >> Eww} = Yo E{w*@} = ns? 
i=1lk=1 i=l 
Therefore, random walk is a nonstationary (or evolutionary) process with zero mean and variance 
that grows with n, the number of steps taken. 


It should be stressed at this point that although any strict-sense stationary signal is wide- 
sense stationary, the inverse is not always true, except if the signal is Gaussian. However 
in practice, it is very rare to encounter a signal that is stationary in the wide sense but not 
stationary in the strict sense (Papoulis 1991). 

Two random signals x(n) and y(n) are called jointly wide-sense stationary if each is 
wide-sense stationary and their cross-correlation depends only on/ = nj — n2 


Pry) = Efx()y*(n —D}s Vy yO = ry OD — Be hy (3.3.26) 


Note that as a consequence of wide-sense stationarity the two-dimensional correlation and 
covariance sequences become one-dimensional sequences. This is a very important result 
that ultimately allows for a nice spectral description of stationary random processes. 


Properties of autocorrelation sequences 


The autocorrelation sequence of a stationary process has many important properties 
(which also apply to autocovariance sequences, but we will discuss mostly correlation 
sequences). Vector versions of these properties are discussed extensively in Section 3.4.4, 
and their proofs are explored in the problems. 
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PROPERTY 3.3.1. The average power of a WSS process x(n) satisfies 

rx 0) = 0% + |u|? = 0 (3.3.27) 
and rx (0) = |rx@|— for all (3.3.28) 
Proof. See Problem 3.21 and Property 3.3.6. 


This property implies that the correlation attains its maximum value at zero lag and 
this value is nonnegative. The quantity |j,|7 is referred to as the average dc power, and the 
quantity oe = y,(0) is referred to as the average ac power of the random sequence. The 
quantity r, (0) then is the total average power of x(n). 


PROPERTY 3.3.2. The autocorrelation sequence r, (/) is a conjugate symmetric function of lag 
I, that is, 


re (1) = rx (1) (3.3.29) 
Proof. It follows from Definition 3.4 and from (3.3.24). 


PROPERTY 3.3.3. The autocorrelation sequence r,(/) is nonnegative definite; that is, for any 
M > Oand any vector a € RM” 


M M 
S> YS agrx(k = mor, = 0 (3.3.30) 
k=1m=1 


This is anecessary and sufficient condition for a sequence r, (/) to be the autocorrelation sequence 
of a random sequence. 


Proof. See Problem 3.22. 


Since in this book we exclusively deal with wide-sense stationary processes, we will 
use the term stationary to mean wide-sense stationary. The properties of autocorrelation and 
cross-correlation sequences of jointly stationary processes, x(n) and y(n), are summarized 
in Table 3.1. 

Although SSS and WSS forms are widely used in practice, there are processes with 
different forms of stationarity. Consider the following example. 


EXAMPLE 3.3.3. Let x(n) be a real-valued random process generated by the system 
x(n) = ax(n — 1) + w(n) n>0 x(—1) =0 (3.3.31) 


where w(n) is a stationary random process with mean fy and rw(/) = 07,5(1 ). The process 


x(n) generated using (3.3.31) is known as a first-order autoregressive, or AR(1), process," and 
the process w(n) is known as a white noise process (defined in Section 3.3.6). Determine the 
mean j2,(n) of x(n) and comment on its stationarity. 


Solution. Tocompute the mean of x(n), we expressitas afunction of {w(n), w(n—1),..., w(0)} 
as follows 

x(0) = ax(—1) + w(O) = w(0) 

x(1) = ax(0) + w(1) = aw(0) + w(1) 


x(n) = a" w(0) +e"! wd) +++ tw) = Yo akwin —k) 
k=0 


Note that from (3.3.31), x(n— 1) completely determines the distribution for x(n), and x(n) completely determines 
the distribution for x(n + 1), and so on. If 


fx (n)|xn—1... &nlXn 1) = Fr(n)lx(n 1) GnlXn D 


then the process is termed a Markov process. 


Hence the mean of x(n) is given by 


A P T= git! 4 ; 
——— a 
y(n) =E So ak win — k) = Ly Sea = l-a Pow 
k=0 k=0 (n+ Ditw a=1 


Clearly, the mean of x(n) depends on n, and hence it is nonstationary. However, if we assume 
that |w| < 1 (which implies that the system is BIBO stable), then as n > ov, we obtain 


— "tl Mw 


> 
l-a n-wol-a 


1 
y(n) = Ly 


Thus x(n) approaches first-order stationarity for large n. Similar analysis for the autocorrelation 
of x(n) shows that x(n) approaches wide-sense stationarity for large n (see Problem 3.23). 


The above example illustrates a form of stationarity called asymptotic stationarity. A 
stochastic process x(n) is asymptotically stationary if the statistics of random variables 
x(n) and x(n + k) become stationary as k — oo. When LTI systems are driven by zero- 
mean uncorrelated-component random processes, the output process becomes asymptoti- 
cally stationary in the steady state. Another useful form of stationarity is given by stationary 
increments. If the increments {x(n) — x(n — k)} of a process x(n) form a stationary process 
for every k, we say that x(n) is a process with stationary increments. Such processes can 
be used to model data in various practical applications (see Chapter 12). 

The simplest way, to examine in practice if a real-world signal is stationary, is to inves- 
tigate the physical mechanism that produces the signal. If this mechanism is time-invariant, 
then the signal is stationary. In case it is impossible to draw a conclusion based on physical 
considerations, we should rely on statistical methods (Bendat and Piersol 1986; Priestley 
1981). Note that stationarity in practice means that a random signal has statistical properties 
that do not change over the time interval we observe the signal. For evolutionary signals the 
statistical properties change continuously with time. An example of a highly nonstationary 
random signal is the signals associated with the vibrations induced in space vehicles during 
launch and reentry. However, there is a kind of random signal whose statistical properties 
change slowly with time. Such signals, which are stationary over short periods, are called 
locally stationary signals. Many signals of great practical interest, such as speech, EEG, 
and ECG, belong to this family of signals. 

Finally, we note that general techniques for the analysis of nonstationary signals do 
not exist. Thus only special methods that apply to specific types of nonstationary signals 
can be developed. Many such methods remove the nonstationary component of the signal, 
leaving behind another component that can be analyzed as stationary (Bendat and Piersol 
1986; Priestley 1981). 


3.3.4 Ergodicity 


A stochastic process consists of the ensemble and a probability law. If this information is 
available, the statistical properties of the process can be determined in a quite straightforward 
manner. However, in the real world, we have access to only a limited number (usually one) 
of realizations of the process. The question that arises then is, Can we infer the statistical 
characteristics of the process from a single realization? 

This is possible for the class of random processes that are called ergodic processes. 
Roughly speaking, ergodicity implies that all the statistical information can be obtained 
from any single representative member of the ensemble. 


Time averages 

All the statistical averages that we have defined up to this point are known as ensemble 
averages because they are obtained by “freezing” the time variable and averaging over the 
ensemble (see Fig. 3.7). Averages of this type are formally defined by using the expectation 
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operator E{ }. Ensemble averaging is not used frequently in practice, because it is imprac- 
tical to obtain the number of realizations needed for an accurate estimate. Thus the need for 
a different kind of average, based on only one realization, naturally arises. Obviously such 
an average can be obtained only by time averaging. 

The time average of a quantity, related to a discrete-time random signal, is defined as 


1 
(OS ti aes) (3.3.32) 


Note that, owing to its dependence on a single realization, any time average is itself a random 
variable. The time average is taken over all time because all realizations of a stationary 
random process exist for all time; that is, they are power signals. 

For every ensemble average we can define a corresponding time average. The following 
time averages are of special interest: 


Mean value = (x(n)) 
Mean square = (|x(n)|?) 
Variance = (|x(n) — (x(n))|?) 
Autocorrelation = (x(n)x*(n — 1)) (3.3.33) 
Autocovariance = ([x(n) — (x(n))][x(n — 1) — (x(n))]*) 
Cross-correlation = (x(n)y*(n — 1)) 
Cross-covariance = ([x(n) — (x(n))]Ly(m — 1) — (y(n))]*) 

It is necessary to mention at this point the remarkable similarity between time averages 
and the correlation sequences for deterministic power signals. Although this is just a formal 
similarity, due to the fact that random signals are power signals, both quantities have the 
same properties. However, we should always keep in mind that although time averages 


are random variables (because they are functions of ¢), the corresponding quantities for 
deterministic power signals are fixed numbers or deterministic sequences. 


Ergodic random processes 


As we have already mentioned, in many practical applications only one realization of 
a random signal is available instead of the entire ensemble. In general, a single member of 
the ensemble does not provide information about the statistics of the process. However, if 
the process is stationary and ergodic, then all statistical information can be derived from 
only one typical realization of the process. 

Arandom signal x (1) is called ergodic’ if its ensemble averages equal appropriate time 
averages. There are several degrees of ergodicity (Papoulis 1991). We will discuss two of 
them: ergodicity in the mean and ergodicity in correlation. 


DEFINITION 3.5 (ERGODIC IN THE MEAN). Arandom process x(1) is ergodic in the mean 
if 
(x(n)) = E{x(n)} (3.3.34) 


DEFINITION 3.6 (ERGODIC IN CORRELATION). A random process x(n) is ergodic in 
correlation if 


(x(n)x*(n — 1)) = E{x(n)x*(n —D} (3.3.35) 
Note that since (x()) is constant and (x(n)x*(n — 1)) is a function of J, if x(n) is 


ergodic in both the mean and correlation, then it is also WSS. Thus only stationary signals 
can be ergodic. On the other hand, WSS does not imply ergodicity of any kind. Fortunately, 


Strictly speaking, the form of ergodicity that we will use is called mean-square ergodicity since the underlying 
convergence of random variables is in the mean-square sense (Stark and Woods 1994). Therefore, equalities in 
the definitions are in the mean-square sense. 


in practice almost all stationary processes are also ergodic, which is very useful for the 
estimation of their statistical properties. From now on we will use the term ergodic to mean 
both ergodicity in the mean and ergodicity in correlation. 


DEFINITION 3.7 (JOINT ERGODICITY). Two random signals are called jointly ergodic if 
they are individually ergodic and in addition 


(x(n)y*(n — 1) = E{x(n)y*(n —D} (3.3.36) 


A physical interpretation of ergodicity is that one realization of the random signal x(n), 
as time n tends to infinity, takes on values with the same statistics as the value x(n), 
corresponding to all samples of the ensemble members at a given time n = nq. 

In practice, it is of course impossible to use the time-average formulas introduced 
above, because only finite records of data are available. In this case, it is common practice 
to replace the operator (3.3.32) by the operator 


1 
(O)y = any 2 () (3.3.37) 


=—N 
to obtain estimates of the true quantities. Our desire in such problems is to find estimates 
that become increasingly accurate (in a sense to be defined in Section 3.6) as the length 
2N + 1 of the record of used data becomes larger. 

Finally, to summarize, we note that whereas stationarity ensures the time invariance 
of the statistics of a random signal, ergodicity implies that any statistics can be calculated 
either by averaging over all members of the ensemble at a fixed time or by time-averaging 
over any single representative member of the ensemble. 


3.3.5 Random Signal Variability 


If we consider a stationary random sequence w(n) that is IID with zero mean, its key charac- 
teristics depend on its first-order density. Figure 3.8 shows the probability density functions 
and sample realizations for IID processes with uniform, Gaussian, and Cauchy probability 
distributions. In the case of the uniform distribution, the amplitude of the random variable is 
limited to a range, with values occurring outside this interval with zero probability. On the 
other hand, the Gaussian distribution does not have a finite interval of support, allowing for 
the possibility of any value. The same is true of the Cauchy distribution, but its characteris- 
tics are dramatically different from those of the Gaussian distribution. The center lobe of the 
density is much narrower while the tails that extend out to infinity are significantly higher. 
As a result, the realization of the Cauchy random process contains numerous spikes or ex- 
treme values while the remainder of the process is more compact about the mean. Although 
the Gaussian random process allows for the possibility of large values, the probability of 
their occurrence is so small that they are not found in realizations of the process. 

The major difference between the Gaussian and Cauchy distributions lies in the area 
found under the tails of the density as it extends out to infinity. This characteristic is related 
to the variability of the process. The heavy tails, as found in the Cauchy distribution, result 
in an abundance of spikes in the process, a characteristic referred to as high variability. On 
the other hand, a distribution such as the Gaussian does not allow for extreme values and 
indicates low variability. The extent of the variability of a given distribution is determined by 
the heaviness of the tails. Distributions with heavy tails are called long-tailed distributions 
and have been used extensively as models of impulsive random processes. 


DEFINITION 3.8. A distribution is called /ong-tailed if its tails decay hyperbolically or alge- 
braically as 


Pr{|x(n)| > x} ~ Cx—% asx > 00 (3.3.38) 


where C is a constant and the variable a determines the rate of decay of the distribution. 
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Sample sequence (uniform) 


—E lr 17 JI 
OP Gb se ! 
= 0 Of 
Dials -1+ 
1 05 0 0 500 1000 
Sample sequence (Gaussian) 
G1 | 
8 
$0}. 0 
oO 
O1+- 1 4 
1 05 0 0 500 1000 
Sample sequence (Cauchy) 
Bir 1 
30 pos 0 
Satie | 
1 05 0 0 500 1000 
FIGURE 3.8 


Probability density functions and sample realizations of an IID process with 
uniform, Gaussian, and Cauchy distributions. 


By means of comparison, the Gaussian distribution has an exponential rate of decay. 
The implication of the algebraically decaying tail is that the process has infinite variance, 
that is, 


a? = E{|x(n)|"} = 00 


and therefore lacks second-order moments. The lack of second-order moments means that, in 
addition to the variance, the correlation functions of these processes do not exist. Since most 
signal processing algorithms are based on second-order moment theory, infinite variance 
has some extreme implications for the way in which such processes are treated. 

In this book, we shall model high variability, and hence infinite variance, using the 
family of symmetric stable distributions. The reason is twofold: First, a linear combination of 
stable random variables is stable. Second, stable distributions appear as limits in central limit 
theorems (see stable distributions in Section 3.2.4). Stable distributions are characterized 
by a parameter a,0 < a < 2. They are Cauchy when a = | and Gaussian when a = 2. 
However, they have finite variance only when a = 2. 

In practice, the type of data under consideration governs the variability of the modeling 
distribution. Random signals restricted to a certain interval, such as the phase of complex 
random signals, are well suited for uniform distributions. On the other hand, signals allowing 
for any possible value but generally confined to a region are better suited for Gaussian 
models. However, if a process contains spikes and therefore has high variability, it is best 
characterized by along-tailed distribution such as the Cauchy distribution. Impulsive signals 
have been found in a variety of applications, such as communication channels, radar signals, 
and electronic circuit noise. In all cases, the variability of the process dictates the appropriate 
model. 


3.3.6 Frequency-Domain Description of Stationary Processes 


Discrete-time stationary random processes have correlation sequences that are functions of 
a single index. This leads to nice and powerful representations in both the frequency and 
the z-transform domains. 


Power spectral density 


The power spectral density (PSD, or more appropriately autoPSD) of a stationary 
stochastic process x(n) is a Fourier transformation of its autocorrelation sequence r,;(/). 
If r,(J) is periodic (which corresponds to a wide-sense periodic stochastic process) in /, 
then the DTFS discussed in Section 2.2.1 can be used to obtain the PSD, which has the 
form of a line spectrum. If r, (1) is nonperiodic, the DTFT discussed in Section 2.2.1 can 
be used provided that r,.(/) is absolutely summable. This means that the process x(m) must 
be a zero-mean process. In general, a stochastic process can be a mixture of periodic and 
nonperiodic components.’ 

If we allow impulse functions in the DTFT to represent periodic (or almost periodic) 
sequences and non-zero-mean processes (see Section 2.2.1), then we can define the PSD as 
[o,@) 

RC y= >) ade!" (3.3.39) 
l=—00 
where @ is the frequency in radians per sample. If the process x(n) is a zero-mean nonpe- 
riodic process, then (3.3.39) is enough to determine the PSD. If x(7) is periodic (including 
nonzero mean) or almost periodic, then the PSD is given by 


Ry (e/®) = S27 Aj5(w — a3) (3.3.40) 
i 
where the A; are amplitudes of r,(/) at frequencies w;. For discussion purposes we will 
assume that x(m) is a zero-mean nonperiodic process. The autocorrelation r,(/) can be 
recovered from the PSD by using the inverse DTFT as 


1 fF ae ae 
ry) = — / Rx (el”) ef! dew (3.3.41) 
2m Jen 


EXAMPLE 3.3.4. Determine the PSD of a zero-mean WSS process x(n) with r; (J) = all 1 < 
a<l. 


Solution. From (3.3.39) we have 


lee) 
Ry(el®) = YO allle-Jol —l<a<l 
l=—o0o 
ee ee ee (3.3.42) 
1—aei® © 1 —ae-ie 
1—a? 


= l<a<l 
1 +a? —2acosw 


which is a real-valued, even, and nonnegative function of w. 


Properties of the autoPSD. The power spectral density R,(e/”) has three key prop- 
erties that follow from corresponding properties of the autocorrelation sequence and the 
DTFT. 


Periodic components are predictable processes as discussed before. However, some nonperiodic components can 
also be predictable. Hence nonperiodic components are not always regular processes. 
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PROPERTY 3.3.4. The autoPSD R,(e/®) is a real-valued periodic function of frequency with 
period 27 for any (real- or complex-valued) process x(n). If x() is real-valued, then R; (e/®) 
is also an even function of w, that is, 


Rx(eJ®) = Ry (e/®) (3.3.43) 
Proof. \t follows from autocorrelation and DTFT properties. 
PROPERTY 3.3.5. The autoPSD is nonnegative definite, that is, 
Rx(e/®) > 0 (3.3.44) 


Proof. This follows from the nonnegative definiteness of the autocorrelation sequence [see also 
discussions leading to (3.4.27)]. 


PROPERTY 3.3.6. The area under R, (e/”) is nonnegative and it equals the average power of 
x(n). Indeed, from (3.3.41) it follows with / = 0 that 


ie : Ry (e/) do = rx (0) = E{|x@)|?} = 0 (3.3.45) 


Proof. \t follows from Property 3.3.5. 


White noise. Arandom sequence w(n) is called a (second-order) white noise process 
with mean j,,, and variance ae, denoted by 


w(n) ~ WN(L), 07,) (3.3.46) 

if and only if E{w(n)} = jw, and 
rw(l) = Efw(n)w*(n — 1D} = o* (1) (3.3.47) 
which implies that R,, (ef®) = o* —-m<oK<7 (3.3.48) 


The term white noise is used to emphasize that all frequencies contribute the same amount 
of power, as in the case of white light, which is obtained by mixing all possible colors by 
the same amount. If, in addition, the pdf of x(7) is Gaussian, then the process is called a 
(second-order) white Gaussian noise process, and it will be denoted by WGN(,,, ae): 

If the random variables w(n) are independently and identically distributed with mean 
[1 and variance o?,, then we shall write 


w(n) ~ TD(ny, a7) (3.3.49) 


This is sometimes referred to as a strict white noise. 

We emphasize that the conditions of uncorrelatedness or independence do not put any 
restriction on the form of the probability density function of w(n). Thus we can have an 
IID process with any type of probability distribution. Clearly, white noise is the simplest 
random process because it does not have any structure. However, we will see that it can be 
used as the basic building block for the construction of processes with more complicated 
dependence or correlation structures. 


Harmonic processes. A harmonic process is defined by 


M 
x(n) = > Agcos(an+o,) az £0 (3.3.50) 
k=1 


where M, {Ay}! , and {ox} are constants and {b,}" are pairwise independent random 
variables uniformly distributed in the interval [0, 277]. It can be shown (see Problem 3.9) 
that x(n) is a stationary process with mean 


E{x(n)}=0 — foralln (3.3.51) 


and autocorrelation 


M 
ry (1) = ay Ax cos wl —~0 <1 <0 (3.3.52) 
k=1 
We note that r,.(/) consists of a sum of “in-phase” cosines with the same frequencies as in 
x(n). 

If wx /(27) are rational numbers, r, (J) is periodic and can be expanded as a Fourier se- 
ries. These series coefficients provide the power spectrum R, (k) of x(n). However, because 
r,(J) is a linear superposition of cosines, it always has a line spectrum with 2M lines of 
strength At /4 at frequencies +a, in the interval [—z, zr]. If r, (/) is periodic, then the lines 
are equidistant (i.e., harmonically related), hence the name harmonic process. If w/(21) 
is irrational, then 7,(/) is almost periodic and can be treated in the frequency domain in 
almost the same fashion. Hence the power spectrum of a harmonic process is given by 


M 2 M 
: A 
R,(e/®) = ) 20 (2) 5(@ — wr) = ) 7 A26(w Ox), -1 <@<7 (3.3.53) 
k=—M 


2 
k=—M 


EXAMPLE 3.3.5. Consider the following harmonic process 
x(n) = cos (0.1lan + $1) +2sin (1.5n + $2) 


where #; and ¢ are IID random variables uniformly distributed in the interval [0, 277]. The 
first component of x(n) is periodic with w; = 0.17 and period equal to 20 while the second 
component is almost periodic with w2 = 1.5. Thus the sequence x(n) is almost periodic. A 
sample function realization of x(n) is shown in Figure 3.9(a). The mean of x(n) is 


Hy (n) = E{x(n)} = E{cos (0.1lan + $1) + 2sin (1.5n + $2)} =0 
and the autocorrelation sequence (using mutual independence between ¢, and ¢2) is 
ry(ny,n2) = E{x(n1)xz(n2)} 
= E{cos (0.l2n, + $1) cos (0.1an2 + $4)} 
+ E{2sin (1.5n, + @2)2 sin (1.5n2 + $2)} 
= 5 COS [0.1m (ny — n2)] + 2cos [1.5(n] — 12)] 
or ry (I) = 50080.121+2cos 1.51 1=ny—n2 


Thus the line spectrum R& is given by 


1 o, =—15 
1 — 
R® = 4 oO. = 0.172 
So opie 
4 oa 
1 wa = 1.5 


and the power spectrum Ry (e/®) is given by 
Ry (ei) = 278(w + 1.5) + 5 5(w 40.10) + 5 5(w ~ 0.12) + 278(@ — 1.5) 


The line spectrum of x(n) is shown in Figure 3.9(b) and the corresponding power spectrum in 
Figure 3.9(c). 


The harmonic process is predictable because any given realization is a sinusoidal se- 
quence with fixed amplitude, frequency, and phase. We stress that the independence of the 
phases is required to guarantee the stationarity of x(n) in (3.3.50). The uniform distribution 
of the phases is necessary to make x(n) a stationary process (see Problem 3.9). The har- 
monic process (3.3.50), in general, is non-Gaussian; however, it becomes Gaussian if the 
amplitudes Az are random variables with a Rayleigh distribution (Porat 1994). 
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The time and frequency-domain description of the harmonic process in Example 3.3.5. 


EXAMPLE 3.3.6. Consider a complex-valued process given by 
x(n) = Aei@on — [Aled @on+) 
where A is a complex-valued random variable and wo is constant. The mean of x(n) 
E{x(n)} = E{A}es@o" 
can be constant only if E{A} = 0. If |A| is constant and ¢ is uniformly distributed on [0, 27], 


then we have E{A} = |A|Efe/9} = 0. In this case the autocorrelation is 


ry (ny,n2) = E{Aed 0+) 4*e— J (@onat+o)y = [A|2e/ 1 —72)00 


Since the mean is constant and the autocorrelation depends on the difference / = ny —1ny, the 
process is wide-sense stationary. 
The above example can be generalized to harmonic processes of the form 
M 
a= > Agel OPP (3.3.54) 
k=1 
where M, {A;}™, and {ox} are constants and {b,}" are pairwise independent random 
variables uniformly distributed in the interval [0, 277]. The autocorrelation sequence is 
M 
r,(l) = a |Ag|7e/ 2H! (3.3.55) 
k=1 
and the power spectrum consists of M impulses with amplitudes 27|A,|? at frequencies 
ow x. If the amplitudes {A;} i , are random variables, mutually independent of the random 
phases, the quantity |A;|? is replaced by E{|A,|7}. 


Cross-power spectral density 


The cross-power spectral density of two zero-mean and jointly stationary stochastic 
processes provides a description of their statistical relations in the frequency domain and is 
defined as the DTFT of their cross-correlation, that is, 


CO 


Rye”) = Yo nyOe I" (3.3.56) 


1=—00 


The cross-correlation r,(/) can be recovered by the inverse DTFT 
1 7 dae 
rey) = — i. Ryy(el®)eF! daw (3.3.57) 
2m Jen 


The cross-spectrum Ryy (e/”) is,in general, acomplex function of w. Fromryy (J) = ee (-l) 
it follows that 
R,y(e/®) = R*(e/”) (3.3.58) 


This implies that Ryy (e/®) and Ryx (e/) have the same magnitude but opposite phase. 
The normalized cross-spectrum 


Ryy(es? 
Gry(ei®) & mea (3.3.59) 
/ Re(el),/ Ry(e7®) 
is called the coherence function. Its squared magnitude 
jy |2 
joy _ IRs) 3.3.60 
a ee erry we (3.3.60) 


is known as the magnitude square coherence (MSC) and can be thought of as a sort of 
correlation coefficient in the frequency domain. If x(n) = y(n), then Gx, (e/®) = 1 (max- 
imum correlation) whereas if x(n) and y(7) are uncorrelated, then R,y(/) = 0 and hence 
Gry (e/”) = 0. In other words, 0 < |Gxy(e/®)| < 1. 


Complex spectral density functions 


If the sequences r,(/) and ryy(/) are absolutely summable within a certain ring of the 
complex z plane, we can obtain their z-transforms 


Ros + On (3.3.61) 
l=—o0o 

Ry Oa So Or (3.3.62) 
l=—oo 


which are known as the complex spectral density and complex cross-spectral density func- 
tions, respectively. If the unit circle, defined by z = e/®, is within the region of convergence 
of the above summations, then 


Ry (e!®) = Ry (2)|z-eie (3.3.63) 
Rxy (e/”) = Ryy (Z)|,-eie (3.3.64) 


The correlation and power spectral density properties of random sequences are summarized 
in Table 3.1. 


EXAMPLE 3.3.7. Consider the random sequence given in Example 3.3.4 with autoPSD in (3.3.42) 


1-a2 
<3 ja| <1 
1+ a+ —2acosw 


Determine the complex autoPSD R, (z). 
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Solution. The complex autoPSD is given by Ry (z) = Rx (el) ja, Since 


eJOpe JO z4z71 
COs W = = 
2 2. : 
z=eJ® 
we obtain 
1-a2 (a —a7)z7! 1 
Ry(2) = = aii < el x 
z+z7! l—(ata!)z-!+z la| 
1 +a? —2a {| ——-— 
2 
Now the inverse z-transform of Ry (z) determines the autocorrelation sequence r, (/), that is, 
(a—a7')z! (a—a7')z! 
Ry(@) = —1y,-l 45-2 = ==] 
l-—(a+ta")z'+z dl —az~')(1 —a7~*z~") 
lal < |z| <|a\7! 
= a| <|z| <la 
(l—az-!) (-a-!z74) 


or 


ry(l) = alu(l) + (a7!) u(-1 — 1) = al! 


(3.3.65) 


This approach can be used to determine autocorrelation sequences from autoPSD functions. 


Table 3.1 provides a summary of correlation and spectral properties of stationary ran- 


dom sequences. 


TABLE 3.1 


Summary of correlation and spectral properties of stationary 


random sequences. 


Definitions 


Mean value 

Autocorrelation 
Autocovariance 
Cross-correlation 
Cross-covariance 

Power spectral density 
Cross-power spectral density 


Magnitude square coherence 


fy = E{x(n)} 

rx (1) = E{[x(n)x* (n — D} 

yx) = E{[x(n) — pyl[x(n — 1) — py ]*} 
rxy (Ll) = E{x(n)y*(n — D} 

Yay @ = Elle) — wy lly —D - pyl*} 

Ry (C3) = ore De I" 

Rry(el®) = Oe  orey Deo! 

IGry(e/)|? = [Rey (el)? /[Rx (01) Ry (e/®)] 


Interrelations 


¥x@0) =rx(@) — |My 


2 


YxyO = ry) _ My by 


Properties 


Autocorrelation 


Auto-PSD 


rx (J) is nonnegative definite 
rx) = rz (1) 

Irx (| < rx) 

Ipx@| <1 


Ry (e/®) > 0 and real 

Ry (e/®) = Ry (e~J®) [real x(n)] 
Ry (z) = RE(1/z*) 

Ry (z) = Rx(z7!) [real x(n)] 


Cross-correlation 


Cross-PSD 


try) =r, (-D 

Irry@)| < [rx O)ry(O)]!/? < 
5[rx(0) + ry OI 

Ipxy@| <1 


Ryy(2) = Ry (1/z") 
0 < |Gxy(e/®)| <1 


3.4 LINEAR SYSTEMS WITH STATIONARY RANDOM INPUTS 


This section deals with the processing of stationary random sequences using linear, time- 
invariant (LTI) systems. We focus on expressing the second-order statistical properties of 
the output in terms of the corresponding properties of the input and the characteristics of 
the system. 


3.4.1 Time-Domain Analysis 


The first question to ask when we apply a random signal to a system is, Just what is the 
meaning of such an operation? We ask this because a random process is not just a single 
sequence but an ensemble of sequences (see Section 3.3). However, since each realization 
of the stochastic process is a deterministic signal, it is an acceptable input producing an 
output that is clearly a single realization of the output stochastic process. For an LTI system, 
each pair of input-output realizations is described by the convolution summation 
[o.@) 
yn,t)= Yo hike” —k, og) (3.4.1) 
k=—oco 

If the sum in the right side of (3.4.1) exists for all ¢ such that Pr{¢} = 1, then we say that we 
have almost-everywhere convergence or convergence with probability 1 (Papoulis 1991). 
The existence of such convergence is ruled by the following theorem (Brockwell and Davis 
1991). 


THEOREM 33.2. If the process x(n, €) is stationary with E{|x(n, ¢)|} < oo and if the system 
is BIBO-stable, that is, cee |h(k)| < oo, then the output y(n, ¢) of the system in (3.4.1) 
converges absolutely with probability 1, or 
[o,@) 
yngy= SY) Alk)x(n—k,6) forall ¢ € A, Pr{A}=1 (3.4.2) 
k=—0o 

and is stationary. Furthermore, if E{|x(n, t)|7} < o, then F{|y(m, t)|7} < o and y(n, ¢) 
converges in the mean square to the same limit and is stationary. 


A less restrictive condition of finite energy on the system impulse response h(n) also 
guarantees the mean square existence of the output process, as stated in the following 
theorem. 


THEOREM 33.3. If the process x(n, ¢) is zero-mean and stationary with Sees Irx(D)| < 00, 
and if the system (3.4.1) satisfies the condition 
(oe) 1 ca : 
S |A(n)|2 = =| |H(e!®)|2 dw < 00 (3.4.3) 
Prue. 2a Jz 


then the output y(n, ¢) converges in the mean square sense and is stationary. 


The above two theorems are applicable when input processes have finite variances. 
However, IID sequences with a-stable distributions have infinite variances. If the impulse 
response of the system in (3.4.1) decays fast enough, then the following theorem (Brockwell 
and Davis 1991) guarantees the absolute convergence of y(n, ¢) with probability 1. These 
issues are of particular importance for inputs with high variability and are discussed in 
Section 3.3.5. 


THEOREM 3.4. Let x(n, €) be an IID sequence of random variables with a-stable distribution, 
0 <a <2. If the impulse response h(n) satisfies 
CO 
oe |h(n)|° <0 for some 6 € (0, a) 
n=—CO 


then the output y(n, ¢) in (3.4.1) converges absolutely with probability 1. 


115 


SECTION 3.4 
Linear Systems with 
Stationary Random Inputs 


116 


CHAPTER 3 
Random Variables, 
Vectors, and Sequences 


Clearly, a complete description of the output stochastic process y(n) requires the com- 
putation of an infinite number of convolutions. Thus, a better alternative would be to de- 
termine the statistical properties of y(7) in terms of the statistical properties of the input 
and the characteristics of the system. For Gaussian signals, which are used very often in 
practice, first- and second-order statistics are sufficient. 


Output mean value. If x(n) is stationary, its first-order statistic is determined by its 
mean value jz,. To determine the mean value of the output, we take the expected value of 
both sides of (3.4.1): 


[o,e) [e,e) 


by = Do MAE WH} =H, DY hb) = 1H!) (3.4.4) 


k=—oco k=—0co 


Since 4, and H(e!®) are constant, Hy is also constant. Note that H(e/”) is the de gain of 
the spectrum. 


Input-output cross-correlation. If we take complex conjugate of (3.4.1), premultiply 
it by x(n + /), and take the expectation of both sides, we have 


E{x(ntDy*O}= Do AWE +)x*(n — b)} 
k=—00o 
or rry@ = SO Arex +h) = Yo h*(m)ryx( =m) 
k=—00 m=—oo 
Hence, rey (Ll) = h*(-D) & rex (D) (3.4.5) 
Similarly, tye Dh) * He D (3.4.6) 


Output autocorrelation. Postmultiplying both sides of (3.4.1) by y*(n —/) and taking 
the expectation, we obtain 


E{y(n)y*(a-D} = h(kK)E{x(n —k)y*(n —1)} (3.4.7) 
k=—00 
or ryy@) = oe h(k)rxy(@ — k) = hl) * rryO (3.4.8) 
k=—00 
From (3.4.5) and (3.4.8) we get 
ry) =h() #h*(-D) #7) (3.4.9) 
or ry@) =rnA) * ry OD (3.4.10) 
where rn(l) = h() x h*(—l) = Ss h(n)h* (n — 1) (3.4.11) 


is the autocorrelation of the impulse response and is called the system correlation sequence. 

Since j4, is constant and ry(/) depends only on the lag /, the response of a stable 
system to a stationary input is also a stationary process. A careful examination of (3.4.10) 
shows that when a signal x(n) is filtered by an LTI system with impulse response h(n) its 
autocorrelation is “filtered” by a system with impulse response equal to the autocorrelation 
of its impulse response, as shown in Figure 3.10. 
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FIGURE 3.10 
An equivalent LTI system for autocorrelation filtration. 


Output power. The power E{|y(n)|*} of the output process y(n) is equal to ry(0), 
which from (3.4.9) and (3.4.10) and the symmetry property of 7; (/) is 


Py =ry) = rm * rx Olizo 


= YO m(b)rx(-k) = SY) [h(k) #h*(—b)Iryk) 


k=—0o k=—00 


= ss ve him)h*(m — k)ry(k) (3.4.12) 
k=—oo M=—O0O 
= > rn(k)ry(k) (3.4.13) 


k=—oo 


or for FIR filters with h = [h(0) h(1) --» h(M — 1)]", (3.4.12) can be written as 
Py =h"*Ryh (3.4.14) 


Finally, we note that when 1, = 0, we have jz, = 0 and os = Py. 


Output probability density function. Finding the probability density of the output of 
an LTT system is very difficult, except in some special cases. Thus, if x() is a Gaussian 
process, then the output is also a Gaussian process with mean and autocorrelation given by 
(3.4.4) and (3.4.10). Also if x (7) is IID, the probability density of the output is obtained by 
noting that y(7) is a weighted sum of independent random variables. Indeed, the probability 
density of the sum of independent random variables is the convolution of their probability 
densities or the products of their characteristic functions. Thus if the input process is an 
IID stable process then the output process is also stable whose probability density can be 
computed by using characteristic functions. 


3.4.2 Frequency-Domain Analysis 


To obtain the output autoPSD and complex autoPSD, we recall that if H(z) = Z{h(n)}, 
then, for real h(n), 


Z{h*(—n)} = H* (=) (3.4.15) 
From (3.4.5), (3.4.6), and (3.4.9) we obtain 

Ryy(z) = H* (=) Rx (2) (3.4.16) 

Ryx(Z) = A(z) Rx(Z) (3.4.17) 


and Ry() = H(z) H* (=) Ry (z) (3.4.18) 
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For a stable system, the unit circle z = e/® lies within the ROCs of H(z) and H(z7!). 
Thus, 


Rxy(e!®) = H*(e!®)R,(e!®) (3.4.19) 
Ryx(e/®) = H(e/®)Rx(e!”) (3.4.20) 
and Ry(e”) = H(e!®) H* (e)”)R,(e!”) (3.4.21) 
or Ry (e!®) = |He!®)?Rx(e!”) (3.4.22) 


Thus, if we know the input and output autocorrelations or autospectral densities, we can 
determine the magnitude response of a system, but not its phase response. Only cross- 
correlation or cross-spectral densities can provide phase information [see (3.4.19) and 
(3.4.20)]. 

It can easily be shown that the power of the output is 


2 1 . jw \2 jw 
Efly(n)|"} = ryy@) = al |H(e!)|Rx(e/°) dw (3.4.23) 
= Yo rOrnO (3.4.24) 
l=—oo 


which is equivalent to (3.4.13). 
Consider now a narrowband filter with frequency response 


Aw A@ 

1 a= SWS we 
H(e!®) = 2 2 (3.4.25) 

0 elsewhere 
The power of the filter output is 
1 Wct+Aa/2 : A , 
E{ly(n)|7} = =| R,(e!°) da x 2O Rx (el”) (3.4.26) 
20 @e—Aw/2 u 


assuming that Aw is sufficiently small and that R, (e/”) is continuous at w = w-. Since 
E{\y(n)|?} = 0, Ry(e/®:) is also nonnegative for all w, and Aw, hence 

R,(e/?) > 0 —-m<w<a7 (3.4.27) 
Hence, the PSD R,(e/”) is nonnegative definite for any random sequence x(n) real (or 
complex). Furthermore, R, (e/”) daw / (27), has the interpretation of power, or R, (e/”) isa 
power density as a function of frequency (in radians per sample). Table 3.2 shows various 
input-output relationships in both the time and frequency domains. 


TABLE 3.2 
Second-order moments of stationary random sequences processed by linear, 
time-invariant systems. 


Time domain Frequency domain z Domain 
y(n) = h(n) * x(n) Not available Not available 
ryx() = h(D «re Ryx (eJ®) = H(e/®) Rx (e/®) Ryx(z) = H(z)Rx(z) 
ray) = h*(-D «rye O Rry(e/®) = H*(e/®) Rx (e/®) Rey(z) = H*(1/z*)Rx(@) 
ry (I) = h() * xy) Ry(e!®) = H(e!®)Rxy(e/®) Ry(z) = H(@)Rxy@) 


ry) =h() *h*(D ere Ry (eJ®) = |H(e/®)|? Ry (e!®) Ry(z) = H(z) H*(1/z*) Rx (2) 


3.4.3 Random Signal Memory 


Given the “zero-memory” process w(n) ~ IID(0, a), we can introduce dependence by 
passing it though an LTT system. The extent and degree of the imposed dependence are 
dictated by the shape of the system’s impulse response. The probability density of w(7) is 


not explicitly involved. Suppose now that we are given the resulting linear process x(n), 
and we want to quantify its memory. For processes with finite variance we can use the 
correlation length 


Le= a y ry(L) = Lew) 


which equals the area under the normalized autocorrelation sequence curve and shows the 
maximum distance at which two samples are significantly correlated. 

An IID process has no memory and is completely described by its first-order density. 
A linear process has memory introduced by the impulse response of the generating system. 
If w(7) has finite variance, the memory of the process is determined by the autocorrelation 
of the impulse response because r,(/) = orp (1). Also, the higher-order densities of the 
process are nonzero. Thus, the variability of the output—that is, what amplitudes the sig- 
nal takes, how often, and how fast the amplitude changes from sample to sample—is the 
combined effect of the input probability density and the system memory. 


DEFINITION 3.9. A stationary process x(n) with finite variance is said to have long memory if 
there exist constants a, 0 < @ < 1, and C; > 0 such that 


1 
lim atx Dit = — 
l+oo +o 


This implies that the autocorrelation has fat or heavy tails, that is, asymptotically decays as 
a power law 


PxD z C,|L|-% asl > oo 


and slowly enough that 


> px) = 00 


l=—0o 


that is, a long-memory process has infinite correlation length. If 


ee p,(l) < 


l=—0o 


we Say that that the process has short memory. This is the case for autocorrelations that 
decay exponentially, for example, p,(/) = a!!!, -1 <a <1. 

An equivalent definition of long memory can be formulated in terms of the power 
spectrum (Beran 1994; Samorodnitsky and Taqqu 1994). 


DEFINITION 3.10. A stationary process x(n) with finite variance is said to have long memory if 
there exist constants 6,0 < 6 < 1, and Cr > 0 such that 


lim Rx (eJ”)|w|F =1 


a>0 Cro2 Xt 


This asymptotic definition implies that 


. Cro2 
R,(e!®) ~ a as w > 0 
lee) 
and RO) = iD S08 
l=—oo 


The first-order density determines the mean value and the variance of a process, whereas 
the second-order density determines the autocorrelation and power spectrum. There is a 
coupling between the probability density and the autocorrelation or power spectrum of a 
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process. However, this coupling is not extremely strong because there are processes that 
have different densities and the same autocorrelation. Thus, we can have random signal 
models with short or long memory and low or high variability. Random signal models are 
discussed in Chapters 4 and 12. 


3.4.4 General Correlation Matrices 


We first begin with the properties of general correlation matrices. Similar properties apply 
to covariance matrices. 


PROPERTY 3.4.1. The correlation matrix of a random vector x is conjugate symmetric or Her- 
mitian, that is, 


Ry =R! (3.4.28) 
Proof. This follows easily from (3.2.19). 


PROPERTY 3.4.2. The correlation matrix of a random vector x is nonnegative definite (n.n.d.); 
or for every nonzero complex vector w = [w; w2 --- wy’, the quadratic form w” Ryw is 
nonnegative, that is, 


w Ryw > 0 (3.4.29) 


Proof. To prove (3.4.29), we define the dot product 
M 
a=wilx=xlw* = > WEXk (3.4.30) 
k=1 


The mean square value of the random variable a is 

E{\a|7} = E{w! xx? w} = w"” E{xx" Ww = w7 Rw (3.4.31) 
Since E{|a|?} > 0, if follows that w Rxw > 0. We also note that a matrix is called positive 
definite (p.d.) if w4 Ryw > 0. 
Eigenvalues and eigenvectors of R 


For a Hermitian matrix R we wish to find an M x | vector q that satisfies the condition 
Rq = Aq (3.4.32) 


where 4 is a constant. This condition implies that the linear transformation performed 
by matrix R does not change the direction of vector q. Thus Rq is a direction-invariant 
mapping. To determine the vector q, we write (3.4.32) as 


(R —-ADq =0 (3.4.33) 


where I is the M x M identity matrix and 0 is an M x 1 vector of zeros. Since q is arbitrary, 
the only way (3.4.33) is satisfied is if the determinant of R — AI equals zero, that is, 


det(R — AI =0 (3.4.34) 


This equation is an Mth-order polynomial in A and is called the characteristic equation of 
R. It has M roots {aj} , called eigenvalues, which, in general, are distinct. If (3.4.34) has 
repeated roots, then R is said to have degenerate eigenvalues. For each eigenvalue A; we 
can satisfy (3.4.32) 


Rq, =Aiq) i=1,...,M (3.4.35) 


where the q; are called eigenvectors of R. Therefore, the M x M matrix R has M eigen- 
vectors. To uniquely determine q;, we use (3.4.35) along with the normality condition that 
qi || = 1. A MATLAB function [Lambda,Q] = eig(R) is available to compute eigenvalues 
and eigenvectors of R. 


There are further properties of the autocorrelation matrix R based on its eigenanalysis, 121 
which we describe below. Consider a matrix R that is Hermitian and nonnegative definite SECTION 3.4 
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PROPERTY 3.43. The matrix R* (k = 1,2,...) has eigenvalues ce aS, sch ae 
Proof. See Problem 3.16. 


PROPERTY 3.4.4. Ifthe eigenvalues 41,2, ..., A, are distinct, the corresponding eigenvectors 
{qi} , ate linearly independent. 


Proof. This property can be proved by using Property 3.4.3. If there exists M not-all-zero 
scalars fel: such that 


M 

Yaa; =0 (3.4.36) 

ist 
then the eigenvectors {q; ap , are said to be linearly dependent. Assume that (3.4.36) is true for 
some not-all-zero scalars {or }M , and that the eigenvalues {at 1 are distinct. Now multiply 
(3.4.36) repeatedly by R*‘,k =0,..., M—1 and use Property 3.4.3 to obtain 


M M 
YS ajR*qi = ajafqi =0 k=0,...,.M—1 (3.4.37) 
i=l i=l 
which can be arranged in a matrix format fori = 1,..., M as 
2 M-1 
h Mis Be: cK | 
1 ip ES 2 
[21q1 o2q2 4393 ... ewau] | | =0 (3.4.38) 
2 y-+| 
E gee 


Since all the A; are distinct, the matrix containing the A; in (3.4.38) above is nonsingular. This 
matrix is called a Vandermonde matrix. Therefore, premultiplying both sides of (3.4.38) by the 
inverse of the Vandermonde matrix, we obtain 


[a1q1 &2q2 #393 .-. ayqul] =9 (3.4.39) 


Since eigenvectors {qi} , are not zero vectors, the only way (3.4.39) can be satisfied is if all 
{aj ee , are zero. This implies that (3.4.36) cannot be satisfied for any set of not-all-zero scalars 


{orj}@ ,» Which further implies that {q; ie , are linearly independent. 
PROPERTY 3.4.5. The eigenvalues {at , are real and nonnegative. 
Proof. From (3.4.35), we have 
q?Rq; =Aia¢q; i=1,2,...,M (3.4.40) 


Since R is positive semidefinite, the quadratic form qi Rq; => 0. Also since qi qj is an inner 
product, qi qi > 0. Hence 
qi Raq; 
a 
q; Wi 


Furthermore, if R is positive definite, then 4; > 0 for all 1 <i < M. The quotient in (3.4.41) is 
a useful quantity and is known as the Raleigh quotient of vector q;. 


0 i=1,2,...,M (3.4.41) 


PROPERTY 3.4.6. If the eigenvalues {A; yee , are distinct, then the corresponding eigenvectors 
are orthogonal to one another, that is, 


ai AA;V>alq;=0 fori xj (3.4.42) 
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Proof. Consider (3.4.35). We have 

Rqj = 4:41 (3.4.43) 
and Rq; = 4;4; (3.4.44) 
for some i # j. Premultiplying both sides of (3.4.43) by qi? , we obtain 


qi! Ra; = af! Aig; = Aiaj i (3.4.45) 


Taking the conjugate transpose of (3.4.44), using the Hermitian property (3.4.28) of R, and using 
the realness Property 3.4.5 of eigenvalues, we get 


qi R=) aj, (3.4.46) 
Now postmultiplying (3.4.46) by g; and comparing with (3.4.45), we conclude that 
higi qi =Ajaiqi = or = (Ai — Aa qi =0 (3.4.47) 


Since the eigenvalues are assumed to be distinct, the only way (3.4.47) can be satisfied is if 
qi qi = 0 fori ¥ j, which further proves that the corresponding eigenvectors are orthogonal 
to one another. 


PROPERTY 3.4.7. Let {qj} es , be an orthonormal set of eigenvectors corresponding to the distinct 


eigenvalues {A; ye , of an M x M correlation matrix R. Then R can be diagonalized as follows: 


A=Q7RQ (3.4.48) 

where the orthonormal matrix Q * [q, --- qj] is known as an eigenmatrix and A isan M x M 
diagonal eigenvalue matrix, that is, 

A # diag(ay,...,Ay) (3.4.49) 


Proof. Arranging the vectors in (3.4.35) in a matrix format, we obtain 
[Rq, Rqo --- Ray] = [4141 A2q2 --- Auau) 
which, by using the definitions of Q and A, can be further expressed as 
RQ=QA (3.4.50) 


Since q;,i = 1,..., M, is an orthonormal set of vectors, the eigenmatrix Q is unitary, that is, 
Q-! = Q#. Now premultiplying both sides of (3.4.50) by Q”, we obtain the desired result. 


This diagonalization of the autocorrelation matrix plays an important role in filtering 
and estimation theory, as we shall see later. From (3.4.48) the correlation matrix R can also 
be written as 

M 


R = QAQ” = diqiay’ +++ Amaman = Yo Amn Gn (3.4.51) 


m=1 
which is known as the spectral theorem, or Mercer's theorem. If R is positive definite (and 
hence invertible), its inverse is given by 


M 
Ro = @AQ")! = QA1Q" = Yana 3.4.52) 


m=1 
because A is a diagonal matrix. 


PROPERTY 3.4.8. The trace of R is the summation of all eigenvalues, that is, 


M 
tr(R) = 0a; (3.4.53) 
i=l 


Proof. See Problem 3.17. 


PROPERTY 3.4.9. The determinant of R is equal to the product of all eigenvalues, that is, 
M 
det R = |R| = I] Aj = |A| (3.4.54) 
i=1 


Proof. See Problem 3.18. 
PROPERTY 3.4.10. Determinants of R and F are related by 
IR| = (P| + eT) (3.4.55) 


Proof. See Problem 3.19. 


3.4.5 Correlation Matrices from Random Processes 


A stochastic process can also be represented as a random vector, and its second-order 
Statistics given by the mean vector and the correlation matrix. Obviously, these quantities 
are functions of the index n. Let an M x | random vector x(n) be derived from the random 
process x(n) as follows: 


x(n) * [x(n) x(n—1) --- xn -M+1)]" (3.4.56) 
Then its mean is given by an M x | vector 


py (n) = [uy(n) Wyn —1) + wy — M +1)" (3.4.57) 
and the correlation by an M x M matrix 
ry (n,n) ss ry(nzn—M +1) 
Ri) = |: ae (3.4.58) 
ry(na-M+1,n) +--+) 7a —-M+1,n-M+1) 
Clearly, R,(n) is Hermitian since r,(n —i,n — jf) =rf(n—j,n—-i),0<i,j <M-—1. 
This vector representation will be useful when we discuss optimum filters. 


Correlation matrices of stationary processes 


The correlation matrix R,.(7) of a general stochastic process x (1) is a Hermitian M x M 
matrix defined in (3.4.58) with elements r,.(n — i,n — j) = E{x(n — i)x*(n — j)}. For 
stationary processes this matrix has an interesting additional structure. First, Ry (7) is a 
constant matrix R,; then using (3.3.24), we have 


n(n—-in—-p=ndG-D=r,0d4j-D (3.4.59) 
Finally, by using conjugate symmetry r; (/) = rf (—1), the matrix R, is given by 
rx (0) rx (1) rx (2) ie: ees) 
re () rx (0) ry (1) “+> 1ye(M — 2) 
R, = | 77 (2) re) rx (0) sae gM = 3) (3.4.60) 
re(M—1) ri(M—2) ri(M—3) --- ry (0) 


It can be easily seen that R, is Hermitian and Toeplitz.’ Thus, the autocorrelation matrix 
of a stationary process is Hermitian, nonnegative definite, and Toeplitz. Note that R,. is not 
persymmetric because elements along the main antidiagonal are not equal, in general. 


aN matrix is called Toeplitz if the elements along each diagonal, parallel to the main diagonal, are equal. 


123 


SECTION 3.4 
Linear Systems with 
Stationary Random Inputs 


124 


CHAPTER 3 
Random Variables, 
Vectors, and Sequences 


Eigenvalue spread and spectral dynamic range 


The ill conditioning of a matrix R, increases with its condition number V(R,) = 
Amax/Amin. When R, is a correlation matrix of a stationary process, then V(R,) is bounded 
from above by the dynamic range of the PSD R, (e/®) of the process x(n). The larger the 
spread in eigenvalues, the wider (or less flat) the variation of the PSD function. This is also 
related to the dynamic range or to the data spread in x (7) and is a useful measure in practice. 
This result is given by the following theorem, in which we have dropped the subscript of 
R,(e/®) for clarity. 


THEOREM 3.5. Consider a zero-mean stationary random process with autoPSD 


lee) 
Rel?) = > res” 
l=—o0o 
then min R(e!®) <A; < max R(e/®) ~— foralli=1,2,...,M (3.4.61) 
Proof. From (3.4.41) we have 
H 
HRq, 
qe = (3.4.62) 
q; Wi 


Consider the quadratic form 


M M 
qi’ Rq; = >> gird — bai 
k=1l=1 


where qj = [gj (1) qj(2) --- qi(M)]". Using (3.3.41) and the stationarity of the process, we 
obtain 


1 TD so age aaah 
qi Ra = — a Wai / R(eI® ef U® dy 
k 1 ae 


(3.4.63) 
an ae Ms 
oe i, R(el®) | S> gk Ke I | |S giDel™ | dw 
wee k=1 l=1 
or Hpg = [" rele joy 3.4.64 
Gp Se = (e")|Q(e?™) |" da (3.4.64) 
—T 
Similarly, we have 
1 7 : 
qi) 4 = 5 [ _ Lote!) do (3.4.65) 


Substituting (3.4.64) and (3.4.65) in (3.4.62), we obtain 


ie |O(e/”) |? R(e/®) dw 
| Le : (3.4.66) 
: |O(e/®) |? dw 


However, since R(e/”) > 0, we have the following inequality: 


min R(el®) |Q(e!®)/?do < iM |Q(e!)/? R(e!®) dw 


—1 
. 4 . 
< max R(e/®) i |O(e!) (do 
® —1 


from which we easily obtain the desired result. The above result also implies that 


jo 
Amax 2 TG ) 


X(R) + (3.4.67) 


Amin min R(e/®) 
oO 


which becomes equality as M — oo. 


3.5 WHITENING AND INNOVATIONS REPRESENTATION 


In many practical and theoretical applications, it is desirable to represent a random vector 
(or sequence) with a linearly equivalent vector (or sequence) consisting of uncorrelated 
components. If x is a correlated random vector and if A is a nonsingular matrix, then the 
linear transformation 


w = Ax (3.5.1) 


results in a random vector w that contains the same “information” as x, and hence random 
vectors x and w are said to be linearly equivalent. Furthermore, if w has uncorrelated com- 
ponents and A is lower-triangular, then each component w; of w can be thought of as adding 
“new” information (or innovation) to w that is not present in the remaining components. 
Such a representation is called an innovations representation and provides additional in- 
sight into the understanding of random vectors and sequences. Additionally, it can simplify 
many theoretical derivations and can result in computationally efficient implementations. 

Since I'y must be a diagonal matrix, we need to diagonalize the Hermitian, positive 
definite matrix I’, through the transformation matrix A. There are two approaches to this 
diagonalization. One approach is to use the eigenanalysis presented in Section 3.4.4, which 
results in the well-known Karhunen-Loéve (KL) transform. The other approach is to use 
triangularization methods from linear algebra, which leads to the LDU (UDL) and LU (UL) 
decompositions. These vector techniques can be further extended to random sequences that 
give us the KL expansion and the spectral factorizations, respectively. 


3.5.1 Transformations Using Eigendecomposition 


Let x be a random vector with mean vector fy, and covariance matrix ['y. The linear 
transformation 


X) =X — My (3.5.2) 


results in a zero-mean vector xo with correlation (and covariance) matrix equal to I'y. This 
transformation shifts the origin of the M-dimensional coordinate system to the mean vector. 
We will now consider the zero-mean random vector xo for further transformations. 


Orthonormal transformation 


Let Qx be the eigenmatrix of Ty, and let us choose Qi as our linear transformation 
matrix A in (3.2.32). Consider 


w = Qi xo = Qi (x — Mx) (3.5.3) 

Then by = QU (E{xo}) = 0 (3.5.4) 
and from (3.2.39) and (3.4.48) 

Tw = Rw = E{Q{’xoxi/ Qu} = QUTxQx = Ax (3.5.5) 


Since Ax is diagonal, Ty is also diagonal, and hence this transformation has some interesting 
properties: 


1. The random vector w has zero mean, and its components are mutually uncorrelated (and 
hence orthogonal). Furthermore, if x is V (wy, Ix), then w is V(0, Ax) with independent 
components. 

2. The variances of random variables w;,i = 1,..., M, are equal to the eigenvalues of 
Tx. 

3. Since the transformation matrix A = Q/ is orthonormal, the transformation is called 
an orthonormal transformation and the distance measure 


d* (xo) = x@/Tolxo (3.5.6) 
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is preserved under the transformation. This distance measure is also known as the 
Mahalanobis distance; and in the case of normal random vectors, it is related to the 
log-likelihood function. 

4. Since w = Qi (x — fx), we have 


wi = ai! (K — My) = [IK — Mxllcos[<(x— py. gi)] i=1,...,M (3.5.7) 


which is the projection of x — fly onto the unit vector q;. Thus w represents x in a new 
coordinate system that is shifted to w, and spanned by q;,i = 1,..., M. A geometric 
interpretation of this transformation for a two-dimensional case is shown in Figure 3.11, 
which shows a contour of d?(xg) = x” Ty ly = wt Ay 'w in the x and w coordinate 
systems (w = OUx). 


X24 wr s FIGURE 3.11 
Orthogonal transformation in two dimensions. 


By, 


Isotropic transformation 


In the above orthonormal transformation, the autocorrelation matrix Rw is diagonal 


but not an identity matrix I. This can be achieved by an additional linear mapping of Ay Nee 
Let 


-1/2 


y= Ay a 


w= Ax QU x = Ax? QH («= My) (3.5.8) 


Then Ry = Ay ?Q?T,Qy Ax? = AX ASA? =I (3.5.9) 


This is called an isotropic transformation because all components of y are zero-mean, 
uncorrelated random variables with unit variance.’ The geometric interpretation of this 
transformation for a two-dimensional case is shown in Figure 3.12. It clearly shows that there 
is not only a shift and rotation but also a scaling of the coordinate axis so that the distribution 
is equal in all directions, that is, it is direction-invariant. Because the transformation A = 


Ay uy "Qu is orthogonal but not orthonormal, the distance measure d 2(xg) is not preserved 
under this mapping. Since the correlation matrix after this transformation is an identity 
matrix I, it is invariant under any orthonormal mapping, that is, 


Q71Q0 = Q7Q=!I (3.5.10) 
This fact can be used for simultaneous diagonalization of two Hermitian matrices. 


EXAMPLE 3.5.1. Consider a stationary sequence with correlation matrix 


iz loa 
~~ lal 


where —1 <a < 1. The eigenvalues 


In the literature, an isotropic transformation is also known as a whitening transformation. We believe that this 


terminology is not accurate because both vectors Qu xq and Ay Me Qu Xo have uncorrelated coefficients. 


x4 yr FIGURE 3.12 
Isotropic transformation in two 
dimensions. 


Original 
distribution 


i Oe Isotropic 


distribution 


> 
0 By, x] 


are obtained from the characteristic equation 


a 1-2 


LK 5 as 
det(R, — AI) = det =(1—a)?-a?2=0 


To find the eigenvector q;, we solve the linear system 
1 1 
j ‘| 2d ee 
= a 
1 1 
a l as ) as ) 


which gives qi” a oe Similarly, we find that ay = 


unit length, we obtain the eigenvectors 


_ ol 1 _ 1 1 
WAI pa) i 


From the above results we see that det R, = 1 — a= AjAq and Q7Q = I, where Q = [q) qz]. 


age If we normalize both vectors to 


3.5.2 Transformations Using Triangular Decomposition 


The linear transformations discussed above were based on diagonalization of hermitian 
matrices through eigenvalue-eigenvector decomposition. These are useful in many detection 
and estimation problems. Triangular matrix decomposition leads to transformations that 
result in causal or anticausal linear filtering of associated sequences. Hence these mappings 
play an important role in linear filtering. There are two such decompositions: the lower- 
diagonal-upper (LDU) one leads to causal filtering while the upper-diagonal-lower (UDL) 
one results in anticausal filtering. 


Lower-diagonal-upper decomposition 
Any Hermitian, positive definite matrix R can be factored as (Goulob and Van Loan 
1989) 
R=LD,L” (3.5.11) 
or equivalently L7-'RL-4 =D, (3.5.12) 


where L is a unit lower triangular matrix, D;, is a diagonal matrix with positive elements, 
and L” is a unit upper triangular matrix. The MATLAB function [L,D]=1d1t (R), given in 
Section 5.2, computes the LDU decomposition. 

Since L is unit lower triangular, we have det R = Fes él where €!,..., ety are the 
diagonal elements of D;. If we define the linear transformation 


w=L'!x 4 Bx (3.5.13) 
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we find that 

Ry = E{ww"} = Lo! E{xx"}L-7 =L7'RL-? =D, (3.5.14) 
Clearly, the components of w are orthogonal, and the elements &/,..., & i are their second 
moments. Therefore, this transformation appears to be similar to the orthogonal one. How- 
ever, the vector w is not obtained as a simple rotation of x. To understand this mapping, we 


first note that B = L~! is also a unit lower triangular matrix (Goulob and Van Loan 1989). 
Then we can write (3.5.13) as 


wy 1 Sit dee OT Pay 
we |=lby -:: 1 se) OO] | xy (3.5.15) 
wm by. «+ bmi +:: 1) Lem 


where b;, are elements of B. From (3.5.15) we conclude that w; is a linear combination of 
Xz, k < i, that is, 


i 
w= S bikxe 1<i<M (3.5.16) 
k=1 


If the signal vector x consists of consecutive samples of a discrete-time stochastic process 
x(n), that is, 


x =[x(n) x(n—1) --- x(n —-M +1)]" (3.5.17) 


then (3.5.16) can be interpreted as a causal linear filtering of the random sequence (see 
Chapter 2). This transformation will be used extensively in optimum linear filtering and 
prediction problems. 

A similar LDU decomposition of autocovariance matrices can be performed by follow- 
ing the identical steps above. In this case, the components of the transformed vector w are 
uncorrelated, and the elements & bs 1 <i < M, of D, are variances. 


Upper-diagonal-lower decomposition 


This diagonalization is almost identical to the previous one and involves factorization 
of a Hermitian, positive definite matrix into an upper-diagonal-lower form. It is given by 


R = UDyU" (3.5.18) 
or equivalently U'RU~™ = Dy = diag(é¥,..., &4)) (3.5.19) 


in which the matrix U is unit upper triangular, the matrix U” is unit lower triangular, and the 
matrix Dy is diagonal with positive elements. Note that U” 4 LandDy 4 D_. Following 
the same analysis as above, we have detR = det Dy = Te é" Since A = U~! is unit 
upper triangular in the transformation w = U~!x, the components of w are orthogonal and 
are obtained by linear combinations of x;,, k > i, that is, 

M 

w= So likxk l<i<M (3.5.20) 

k=i 
This represents an anticausal filtering of a random sequence if x is a signal vector. Table 3.3 
compares and contrasts orthogonal and triangular decompositions. We note that the LDU 
decomposition does not have the nice geometric interpretation (rotation of the coordinate 
system) of the eigendecomposition transformation. 


Generation of real-valued random vectors with given second-order moments. Sup- 
pose that we want to generate M samples, say, x1,x2,...,xm, of a real-valued random 
vector x with mean 0 and a given symmetric and positive definite autocorrelation matrix Rx. 


TABLE 3.3 
Comparison of orthogonal and triangular decompositions 
for zero-mean random vectors. 


Orthogonal decomposition Triangular decomposition 
R = E{xx} R= E{xx"} 
Rqj = 4:4; 
Q=[q1.q...., qu] L = unit lower triangular 
A = diag{Ay,A2,..., Am} D = diag{é) &,...,Ey} 
R= QAQ” = Maal! R = LDL! 
= Q7RQ D=L"'RL-# 
R= QA-'Q4 = sug q! Ro SLE pL! 
=Q!R-'Q p-! =L-4#R-'!,-! 
det R = det A = J], a; det R = detD = [J]! &; 


tR=tA= ree 


Whitening (noncausal) Whitening (causal) 
w= Qi x w=L!x 
E{wwt}=A E{ww#} =D 


The innovations representation given in this section suggests three approaches to generate 
samples of such a random vector. The general approach is to factor Rx, using either the 
opr or the triangularization transformation, to obtain the diagonal matrix (Ax or 
Do or Dy ), generate M samples of an IID sequence with the obtained diagonal variances, 
and then transform these samples by using the inverse transformation matrix (Q, or Ly or 
Ux). We hasten to add that, in general, the original distribution of the ID samples will not be 
preserved unless the samples are jointly normal. Therefore, in the following discussion, we 
assume that a normal pseudorandom number generator is used to generate M independent 
samples of w. The three methods are as follows. 


Eigendecomposition approach. First factor Rx as Ry = QxAxQ? . Then generate 
w, using the distribution (0, Ax). Finally, compute the desired vector x, using 
x = Qyw. 

LDU triangularization approach. First factor Ry as Ry = L,D* LH . Then generate 
w, using the distribution (0, De )y, Finally, compute the desired vector x, using 
x= Lyw.’ 

UDL triangularization approach. First factor Rx as Ry = UxDe ) (Wess . Then generate 


w, using the distribution (0, De »y, Finally, compute the desired vector x, using 
x = Uyw. 


Additional discussion and more complete treatment on the generation of random vectors 
are given in Johnson (1994). 


3.5.3 The Discrete Karhunen-Loéve Transform 


In many signal processing applications, it is convenient to represent the samples of a random 
signal in another set of numbers (or coefficients) so that this new representation possesses 
some useful properties. For example, for coding purposes we want to transform a signal 


"Ve we use the Cholesky decomposition Rx = LyL? , where Ly = (pi )yl/ 21x, then w = V0, I) will generate 
x with the given correlation R,, using x = Lyw. 
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so that its energy is concentrated in only a few coefficients (which are then transmitted); 
or for optimal filtering purposes we may want uncorrelated samples so that the filtering 
complexity is reduced or the signal-to-noise ratio is enhanced. A general approach is to 
expand a signal as a linear combination of orthogonal basis functions so that components 
of the signal with respect to basis functions do not interfere with one another. There are 
several such basis functions; the most widely known is the set of complex exponentials 
used in DTFT (or DFT) that are used in linear filtering, as we discussed in Section 3.4. 
Other examples are functions used in discrete cosine transform, discrete sine transform, 
Haar transform, etc., which are useful in coding applications (Jain 1989). 

As discussed in this section, a set of orthogonal basis functions for which the signal 
components are statistically uncorrelated to one another is based on the second-order prop- 
erties of the random process and, in particular, on the diagonalization of its covariance 
matrix. It is also an optimal representation of the signal in the sense that it provides a repre- 
sentation with the smallest mean square error among all other orthogonal transforms. This 
has applications in the analysis of random signals as well as in coding. This transform was 
first suggested by Karhunen and Loéve for continuous random processes. It was extended to 
discrete random signals by Hotelling and is also known as the Hotelling transform. In keep- 
ing with the current nomenclature, we will call it the discrete Karhunen-Loéve transform 
(DKLT) (Fukunaga 1990). 


Development of the DKLT 
Let x = [xj x2 --- x ul be a zero-mean’ random vector with autocorrelation matrix 
Rx. We want to represent x using the linear transformation 
w=A"%x A '=A4 (3.5.21) 


where A is a unitary matrix. Then 


M 
x=Aw=)owa;  affaj=0 i Fj (3.5.22) 
i=l 


Let us represent x using the first m, 1 < m < M, components of w, that is, 
m 
82) wa; 1<m<M (3.5.23) 
i=1 
Then from (3.5.22) and (3.5.23), the error between x and X is given by 


M m M 
en 2X-K= > wiai — » wa; = ¥ wa; (3.5.24) 
i=l i=l i=m+1 


and hence the mean-squared error (MSE) is 


M M 
Em © Efefen}= Yo af E{|wiP}ai= S > E{|wil}a/a; (6528) 
i=m+1 i=m+1 
Since from (3.5.21) w; = al’x, we have E{|w; 7} = a Rya;. Now we want to determine 
the matrix A that will minimize the MSE E,, subject to ala; =liz=m+1,...,Mso 
that from (3.5.25) 
M M 
En = > E{|w;|?} = y a’Rya; = ata; =1 =i=m+1,...,M (3.5.26) 
i=m+1 i=m+1 


+ ‘ ‘ , ‘i 
If the mean is not zero, then we perform the transformation on the mean-subtracted vector, using the covariance 
matrix. 


This optimization can be done by using the Lagrange multiplier approach (Appendix B); 
that is, we minimize 


M M 
>> af Raj + > Ad—afaj) ismt+l,...,M 
i=m+1 i=m+1 
Hence after setting the gradient equal to zero, 


M M 
Va | >. a Rxaj+ )) aj —aja;) | = (Ryaj)* — Qiaj)*=0 (3.5.27) 
i=m+1 i=m+1 
we obtain Ryaj = A;a; i=m+1,...,M 


which is equivalent to (3.4.35) in the eigenanalysis of Section 3.4.4. Hence 4; is the eigen- 
value, and the corresponding a; is the eigenvector of Rx. Clearly, since | < m < M, the 
transformation matrix A should be chosen as the eigenmatrix Q. Hence 


qi — 
Al es Gt +s t 
wi = : : ‘ x 
| : : : i 
<< ay — 
or more concisely w= Q’x (3.5.28) 


provides an orthonormal transformation so that the transformed vector w is a zero-mean, 
uncorrelated random vector with autocorrelation A. This transformation is called the DKLT, 
and its inverse relationship (or synthesis) is given by 


t qe a eee t 
ee Mae ae ae ae (3.5.29) 
1 4 doce ¥ 1 

or xX = Qw = qiwi + qQnu2+---+quwu (3.5.30) 


From Section 3.5.1, the geometric interpretation of this transformation is that {wy}" are 


projections of the vector x with respect to the rotated coordinate system of {qu} V4 . The 
eigenvalues A; also have an interesting interpretation, as we shall see in the following 
representation. 


Optimal reduced-basis representation 


Generally we would expect any transformation to provide only few meaningful com- 
ponents so that we can use only those basis vectors resulting in a smaller representation 
error. To determine this reduced-basis representation property of the DKLT, let us use first 
K < M eigenvectors (instead of all q;). Then from (3.5.26), we have 


M 
Ex= ye hi (3.5.31) 
i=K+1 
In other words, the MSE in the reduced-basis representation, when the first K basis vectors 
are used, is the sum of the remaining eigenvalues (which are never negative). Therefore, to 
obtain a minimum MSE (that is, an optimum) representation, the procedure is to choose K 
eigenvectors corresponding to the K largest eigenvalues. 


Application in data compression. The DKLT is a transformation on a random vector 
that produces a zero-mean, uncorrelated vector and that can minimize the mean square 
representation error. One of its popular applications is data compression in communications 
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and, in particular, in speech and image coding. Suppose we want to send a sample function 
of a speech process x,(t). If we sample this waveform and obtain M samples {x(n)}" ON3 
then we need to send M data values. Instead, if we analyze the correlation of {x(n)}f" ot 
and determine that M values can be approximated by a smaller K numbers of w; and 
the corresponding q;, then we can compute these K data values {w;}K at the transmitter 
and send them to the receiver through the communication channel. At the receiver, we 
can reconstruct eG by using (3.5.23), as shown in Figure 3.13. Obviously, both 
the transmitter and receiver must have the information about the eigenvectors {qi}! LA 
considerable amount of compression is achieved if K is much smaller than M. 


x(n) w(n) | Reduced-basis | #(n) Inverse | 2(”) 
DKLT selection DKLT 
Uncoded scheme Coded Reconstructed 


signal signal signal 


FIGURE 3.13 
Signal coding scheme using the DKLT. 


Periodic random sequences 


As we noted in the previous section, the correlation matrix of a stationary process is 
Toeplitz. If the autocorrelation sequence of a random process is periodic with fundamental 
period M, its correlation matrix becomes circulant. All rows (columns) of a circulant matrix 
are obtained by circular rotation of its first row (column). Using (3.4.60) and the periodicity 
relation r,(/) = r,(1 — M), we obtain 


rx (0) ry (1) r,(2) +++ 1(M—1) 
rx(M— 1) rx) Al) 39%) TOM = 2) 

R, = |%(4—-2) r(M—-1) 71.) +--+ rt —3) (3.5.32) 
ry (1) ry (2) rz (3) +++ 7, (0) 


which is a circulant matrix. We note that a circulant matrix is Toeplitz but not vice versa. 
If we define the M-point DFT of the periodic sequence r, () 


M-1 
Rk) = Yo re OW (3.5.33) 
1=0 
where Wy & e—/2"/™, and the vector 
1 . 
wi = ——U1 Wh wee... WMD oO <k<M-1 (3.5.34) 


JM 


we can easily see that multiplying the first row of R, by the vector wy, results in Ry (k) /VM. 
Using Wik = Wee we find that the product of the second row by wy is equal to 
Ry (k) Wi /VM. In general, the ith row by wy, gives Ry(k) WEDE LY M. Therefore, we 
have 

Rew, = Ry(k)we = OS KS M—-1 (3.5.35) 
which shows that the normalized DFT vectors wx are the eigenvectors of the circulant 
matrix R, with as corresponding eigenvalues the DFT coefficients Ry (k). Therefore, the 
DFT provides the DKLT of periodic random sequences. We recall that R, (k) are samples 
of the DITFT R, (e/?7*/™) of the finite-length sequence r,(J),0 <1 < M—1. 


If we define the M x M matrix 
W =Iwowi +: ww-1] (3.5.36) 
we can show that 
wiw = ww" =I (3.5.37) 
that is, the matrix W is unitary. The set of equations (3.5.35) can be written as 
W”R,W = diag{R, (0), Ry(1),..., Ry(M — D} (3.5.38) 


which shows that the DFT performs the diagonalization of circulant matrices. Although 
there is no fast algorithm for the diagonalization of general Toeplitz matrices, in many 
cases we can use the DFT to approximate the DKLT of stationary random sequences. The 
approximation is adequate if the correlation becomes negligible for |/| > M, which is 
the case for many stationary processes. This explains the fact that the eigenvectors of a 
Toeplitz matrix resemble complex exponentials for large values of M. The DKLT also can 
be extended to handle the representation of random sequences. These issues are further 
explored in Therrien (1992), Gray (1972), and Fukunaga (1990). 


3.6 PRINCIPLES OF ESTIMATION THEORY 


The key assumption underlying our discussion up to this point was that the probability 
distributions associated with the problem under consideration were known. As a result, 
all required probabilities, autocorrelation sequences, and PSD functions either could be 
derived from a set of assumptions about the involved random processes or were given a 
priori. However, in most practical applications, this is the exception rather than the rule. 
Therefore, the properties and parameters of random variables and random processes should 
be obtained by collecting and analyzing finite sets of measurements. In this section, we 
introduce some basic concepts of estimation theory that will be used repeatedly in the rest 
of the book. Complete treatments of estimation theory can be found in Kay (1993), Helstrom 
(1995), Van Trees (1968), and Papoulis (1991). 


3.6.1 Properties of Estimators 


Suppose that we collect N observations {x(n)}q) ~! froma stationary stochastic process and 
use them to estimate a parameter 6 (which we assume to be real-valued) of the process 
using some function A[{x(n)}q ie The same results can be used for a set of measurements 
{xx (n)} obtained from N sensors sampling stochastic processes with the same distribu- 
tions. The function A[{x(n)}q z is known as an estimator whereas the value taken by the 
estimator, using a particular set of observations, is called a point estimate or simply an 
estimate. The intention of the estimator design is that the estimate should be as close to the 
true value of the parameter as possible. However, if we use another set of observations or a 
different number of observations from the same set, it is highly unlikely that we will obtain 
the same estimate. As an example of an estimator, consider estimating the mean ju, of a 
stationary process x(n) from its N observations {x(n)}q) ~! Then the natural estimator is a 
simple arithmetic average of these observations, given by 


N-1 


n 1 
fix = OLIN 1 = > DL x) (3.6.1) 


n=0 
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Similarly, a natural estimator of the variance o2 of the process x(n) would be 
P ee 
Gr = Oe} T= 3 DU) = fag? (3.6.2) 
n=0 

If we repeat this procedure a large number of times, we will obtain a large number of es- 
timates, which can be used to generate a histogram showing the distribution of the estimates. 
Before the collection of observations, we would like to describe all sets of data that can be 
obtained by using the random variables {x(n, oy ~! The obtained set of N observations 
{x Gis can thus be regarded as one realization of the random variables {x(n, eee 
defined on an N-dimensional sample space. In this sense, the estimator Al {x(n, ann =H 
becomes a random variable whose distribution can be obtained from the joint distribution 
of the random variables {x(n, ay —! This distribution is called the sampling distribution 
of the estimator and is a fundamental concept in estimation theory because it provides all 
the information we need to evaluate the quality of an estimator. 

The sampling distribution of a “good” estimator should be concentrated as closely as 
possible about the parameter that it estimates. To determine how “good” an estimator is 
and how different estimators of the same parameter compare with one another, we need to 
determine their sampling distributions. Since it is not always possible to derive the exact 
sampling distributions, we have to resort to properties that use the lower-order moments 
(mean, variance, mean square error) of the estimator. 


Bias of estimator. The bias of an estimator 6 ofa parameter 6 is defined as 


B(6) = E[6] — 6 (3.6.3) 
while the normalized bias is defined as 
BO 
& = ~ 640 (3.6.4) 


When Bb) = 0, the estimator is said to be unbiased and the pdf of the estimator is centered 
exactly at the true value @. Generally, one should select estimators that are unbiased such 
as the mean estimator in (3.6.1) or very nearly unbiased such as the variance estimator in 
(3.6.2). However, it is not always wise to select an unbiased estimator, as we will see below 
and in Section 5.2 on the estimation of autocorrelation sequences. 


Variance of estimator. The variance of the estimator 6 is defined by 
var(6) = 03 © E{|6 — E{6}I"} (3.6.5) 


which measures the spread of the pdf of 6 around its average value. Therefore, one would 
select an estimator with the smallest variance. However, this selection is not always com- 
patible with the small bias requirement. As we will see below, reducing variance may result 
in an increase in bias. Therefore, a balance between these two conflicting requirements is 
required, which is provided by the mean square error property. The normalized standard 
deviation (also called the coefficient of variation) is defined by 


a 
ep 7" 640 (3.6.6) 


Mean square error. The mean square error (MSE) of the estimator is given by 
MSE(6) = E{|@ — 6/7} = 05 + | Bal? (3.6.7) 
Indeed, we have 
MSE() = E{\o — E{6} — (@ — E{6})17} 
= E{\0 — E{6}|"} + E{\6 — E{6})7} (3.6.8) 
—(6 — E{6})E{(6 — E{6})*} — @ — E{6})* E{6 — E£{6}} 
= |0 — E{6}|? + E{|6 — E{6}|7} (3.6.9) 


which leads to (3.6.7) by using (3.6.3) and (3.6.5). Ideally, we would like to minimize the 
MSE, but this minimum is not always zero. Hence minimizing variance can increase the 
bias. The normalized MSE is defined as 

~ MSE(@) 


6£0 (3.6.10) 


Cramér-Rao lower bound. If it is possible to minimize the MSE when the bias is zero, 
then clearly the variance is also minimized. Such estimators are called minimum variance 
unbiased estimators, and they attain an important minimum bound on the variance of the 
estimator, called the Cramér-Rao lower bound (CRLB), or minimum variance bound. If 6 
is unbiased, then it follows that E{@ — 6} = 0, which may be expressed as 


Jo [O-efx00 0) dx =0 (3.6.11) 


where x(¢) = [x1(f), x2(0),...,. XN (c)]" and fx.9 (x; @) is the joint density of x(¢), which 
depends on a fixed but unknown parameter 0. If we differentiate (3.6.11) with respect to 6, 
assuming real-valued 6, we obtain 


= “fi [(6 — 0) fx:o (x; )] jars fo “fe 0) afee tS a ee (3.6.12) 


Using the fact 


Inf fx:6 (Xs )] 1 Ofx:a(& 8) 
YA ~ feo(xi0) 00 
Ofy-9(X:9) 9 In[ fy-9 (x; @ 
or wo Nees Jao Me 9(; 6) (3.6.13) 
and substituting (3.6.13) in (3.6.12), we get 
Jef {@ gyno OY cat 6)dx =1 (3.6.14) 


Clearly, the left side of (3.6.14) is simply the expectation of the expression inside the 
brackets, that is, 


9 Inl fx, (x; 9) _1 (3.6.15) 


E\6-86 

{ — 

Using the Cauchy-Schwarz inequality (Papoulis 1991; Stark and Woods 1994) 
EOI < Elle) PVE {ly O17}, we obtain 


: 2 ‘ 
E(0 —6))E | (eee) | > {@ gy > =1 (3.6.16) 


The first term on the left-hand side is the variance of the estimator 6 since it is unbiased. 
Hence 


R 1 
var(@) > 5 (3.6.17) 
E{[d In fx,9 (x; 6)/00}"} 
which is one form of the CRLB and can also be expressed as 
Fe 1 
var(@) > (3.6.18) 


E{07 In fx:9 (x; 0)/007} 
The function In fx. (x; @) is called the log likelihood function of 6. The CRLB expresses 
the minimum error variance of any estimator 6 of 6 in terms of the joint density fx-9 (x; @) 
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of observations. Hence every unbiased estimator must have a variance greater than a certain 
number. An unbiased estimate that satisfies the CRLB (3.6.18) with equality is called an 
efficient estimate. If such an estimate exists, then it can be obtained as a unique solution to 
the likelihood equation 


On feo (XO) 

a0 7 
The solution of (3.6.19) is called the maximum likelihood (ML) estimate. Note that if the 
efficient estimate does not exist, then the ML estimate will not achieve the lower bound 
and hence it is difficult to ascertain how closely the variance of any estimate will approach 


the bound. The CRLB can be generalized to handle the estimation of vector parameters 
(Therrien 1992). 


0 (3.6.19) 


Consistency of estimator. If the MSE of the estimator can be made to approach zero 
as the sample size N becomes large, then from (3.6.7) both the bias and the variance will 
tend to zero. Then the sampling distribution will tend to concentrate about 6, and eventually 
as N — oo, the sampling distribution will become an impulse at @. This is an important 
and desirable property, and the estimator that possesses it is called a consistent estimator. 


Confidence interval. If we know the sampling distribution of an estimator, we can 
use the observations to compute an interval that has a specified probability of covering 
the unknown true parameter value. This interval is called a confidence interval, and the 
coverage probability is called the confidence level. When we interpret the meaning of 
confidence intervals, it is important to remember that it is the interval that is the random 
variable, and not the parameter. This concept will be explained in the sequel by means of 
specific examples. 


3.6.2 Estimation of Mean 


The natural estimator of the mean 1, of a stationary sequence x(n) from the observations 
{x aie is the sample mean, given by 
, Na 
it — eee XxX 3.6.20 
fix = 3 Dox) (3.6.20) 
n=0 
The estimate jz, is a random variable that depends on the number and values of the obser- 
vations. Changing N or the set of observations will lead to another value for (1. Since the 
mean of the estimator is given by 


E{fix} = by (3.6.21) 
the estimator /1, is unbiased. If x(n) ~ WN(,, a2), we have 
var(jl,) = a5 (3.6.22) 
Hy) = 6. 


because the samples of the process are uncorrelated random variables. This variance, which 
is a measure of the estimator’s quality, increases if x (7) is nonwhite. 

Indeed, for a correlated random sequence, the variance of ji, is given by (see Prob- 
lem 3.30) 


N N 
. Bs I/| 2 
var (fix) = N'Y) (: —y)vxOsN™ DU lr.O (3.6.23) 
I=-N I=—N 
where y , (J) is the covariance sequence of x(n). If y,.(J) > Oasl > oo, then var({i,) > 0 


as N — oo and hence (1, is a consistent estimator of w,. If a aa lv, (| < oo, then 


from (3.6.23) 
N i1| oo 
slim N var(it,) = Jim (1 — 7) viD= Yo vO (3.6.24) 
iS. l=—0o 
The expression for var(/z,) in (3.6.23) can also be put in the form (see Problem 3.30) 
ee 
var (fix) = *[1 + Aw(py)] (3.6.25) 
N 
l (/) 
where An (px) = 250 (: _ x) px) px) = a (3.6.26) 


l=1 # 


When Ayn(p,) = O, the variance of the estimator increases as the amount of correlation 
among the samples of x(7) increases. This implies that as the correlation increases, we need 
more samples to retain the quality of the estimate because each additional sample carries 
“less information.” For this reason the estimation of long-memory processes and processes 
with infinite variance is extremely difficult. 


Sampling distribution. If we know the joint pdf of the random variables {x (n)}q. a 
we can determine, at least in principle, the pdf of jz,.. For example, if it is assumed that the 
observations are IID as N(u,., e-) then from (3.6.21) and (3.6.22), it can be seen that (1, 
is normal with mean jz, and variance o2 /N, that is, 


, 1 1 (fix — be \* 
Siu, Hx) = exp ( (3.6.27) 
pe 2 (ox/VN) 2 \o./V/N 
which is the sampling distribution of the mean. If N is large, then from the central limit 
theorem, the sampling distribution of the sample mean (3.6.27) is usually very close to the 


normal distribution, even if the individual distributions are not normal. 
If we know the standard deviation 0, we can compute the probability 


— < < + oO. 
JN Uy My JN 


that the random variable ji, is within a certain interval specified by two fixed quantities. A 
simple rearrangement of the above inequality leads to 


Pr {ick 


o o 
Pr fa - ae < fy < fi, t+ Ka} (3.6.29) 
which gives the probability that the fixed quantity j1, lies between the two random variables 
jt, —ko,/VN and fi, +ko,//N. Hence (3.6.29) provides the probability that an interval 
with fixed length 2ko,/./N and randomly centered at the estimated mean includes the 
true mean. If we choose k so that the probability defined by (3.6.29) is equal to 0.95, the 
interval is known as the 95 percent confidence interval. To understand the meaning of this 
reasoning, we stress that for each set of measurements we compute a confidence interval 
that either contains or does not contain the true mean. However, if we repeat this process for 
a large number of observation sets, about 95 percent of the obtained confidence intervals 
will include the true mean. We stress that by no means does this imply that a confidence 
interval includes the true mean with probability 0.95. 

If the variance o2 is unknown, then it has to be determined from the observations. This 


x 
results in two modifications of (3.6.29). First, 0, is replaced by 


N-1 

Ou... . 

62 = yy DL @) - AL? (3.6.30) 
n=0 
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which implies that the center and the length of the confidence interval are different for 
each set of observations. Second, the random variable (jf, — ,)/(6x/WN) is distributed 
according to Student's t distribution with v = N — | degrees of freedom (Parzen 1960), 
which tends to a Gaussian for large values of N. In these cases, the factor k in (3.6.29) 
is replaced by the appropriate value f of Student’s distribution, using N — 1 degrees of 
freedom, for the desired level of confidence. 

If the observations are normal but not IID, then from (3.6.25), the mean estimator /1,. 
is normal with mean p and variance (o2/N)[1 + An(e,)]. It is now easy to construct 
exact confidence intervals for jz, if , (J) is known, and approximate confidence intervals 
if :,,(/) is to be estimated from the observations. For large N, the variance var(/i,) can be 
approximated by 


"311 + Ay(py)] 


at + ve 0) (3.6.31) 
‘ N 
< rath a2 0 


and hence an approximate 95 percent confidence interval for (1, is given by 


(i a4 96%. iwi 96/0] ) (3.6.32) 


This means that, on average, the above interval will enclose the true value wz, on 95 percent 
of occasions. For many practical random processes (especially those modeled as ARMA 
processes), the result in (3.6.32) is a good approximation. 


var(j1,) = 


I? 


Zleb ZA 


[I> 


EXAMPLE 3.6.1. Consider the AR(1) process 
x(n) = ax(n — 1) + w(n) -l<a<l 


where w(n) ~ WN(0, o%,). We wish to compute the variance of the mean estimator ji, of the 
process x(n). Using straightforward calculations, we obtain 


o2 
fy=0 of =—4*> and py (I) =a"! 
l-a 
From (3.6.26) we evaluate the term 
Aes are ett = for N > 1 
— (oned or 
NOK = Pag Nd—a) Nd—a)| 1a 


When a — 1, that is, when the dependence between the signal samples increases, then the factor 
An (p) takes large values and the quality of estimator decreases drastically. Similar conclusions 
can be drawn using the approximation (3.6.31) 


=({1 5S o%, _ ow 
pe be a era ee 


We will next verify these results using two Monte Carlo simulations: one for a = 0.9, which 
represents high correlations among samples, and the other oe a = 0.1. Using a Gaussian 
pseudorandom number generator with mean 0 and variance Gs, = 1, we generated N = 100 
samples of the AR(1) process x(n). Using v in (3.6.31) and (3.6.32), we next computed the 
confidence intervals. For a = 0.9, we obtain 


v = 100 and confidence interval: (j1, — 1.96, (1, + 1.96) 
and for a = 0.1, we obtain 


v = 1.2345 and confidence interval: (jz, — 0.2178, jz, + 0.2178) 


Clearly, when the dependence between signal samples increases, the quality of the estimator 
decreases drastically and hence the confidence interval is wider. To have the same confidence 
interval, we should increase the number of samples NV. 

We next estimate the mean, using (3.6.20), and we repeat the experiment 10,000 times. 
Figure 3.14 shows histograms of the computed means for a = 0.9 and a = 0.1. The confidence 
intervals are also shown as dotted lines around the true mean. The histograms are approximately 
Gaussian in shape. The histogram for the high-correlation case is wider than that for the low- 
correlation case, which is to be expected. The 95 percent confidence intervals also indicate that 
very few estimates are outside the interval. 
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Histograms of mean estimates in Example 3.6.1. 


3.6.3 Estimation of Variance 


The natural estimator of the variance o ,. of a stationary sequence x (7) from the observations 
eC) ie is the sample variance, given by 


cae YS {x(n) — fi? (3.6.33) 


By using the mean estimate fi, from (3.6.20), the mean of the variance estimator can be 
shown to equal (see Problem 3.31) 
lL. ul 
E{67} = 0% —var(fi,) =02 — — > (: a a) y,() (3.6.34) 


N l=—N MN 


If the sequence x(7) is uncorrelated, then 


2 N-1 
E(62} = 0? — = = (“=) o (3.6.35) 
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From (3.6.34) or (3.6.35), it is obvious that the estimator in (3.6.33) is biased. If y, (1) > 0 
as 1 — oo, then var(ji,) — 0 as N — oo and hence 6 is an asymptotically unbiased 
estimator of Ge: In practical applications, the variance estimate is nearly unbiased for large 
N. Note thatif we use the actual mean jw, in (3.6.33), then the resulting estimator is unbiased. 
The general expression for the variance of the variance estimator is fairly complicated 

and requires higher-order moments. It can be shown that for either estimators 
? 


var (67) a for large N (3.6.36) 
where yo is the fourth central moment of x(n) (Brockwell and Davis 1991). Thus the 
estimator in (3.6.33) is also consistent. 


Sampling distribution. In the case of the mean estimator, the sampling distribution 
involved the distribution of sums of random variables. The variance estimator involves the 
sum of the squares of random variables, for which the sampling distribution computation 
is complicated. For example, if there are N independent measurements from an \V(0, 1) 
distribution, then the sampling distribution of the random variable 


Wy Hap taZ tee txh (3.6.37) 
is given by the chi-squared distribution with N degrees of freedom. The general form of rar 
with v degrees of freedom is 


1 ‘ x 
2 jis _*x 
v a 2PE(/2)” exp ( =) 0<x<o (3.6.38) 


where ['(v/2) = lie e~'t”/2! dt is the gamma function with argument v/2. 


For the variance estimator in (3.6.33), it can be shown (Parzen 1960) that N rez is 
distributed as chi squared with v = N — 1 degrees of freedom. This means that, for any set 
of N observations, there will only be N — 1 independent deviations {x (n) — {1}, since their 
sum is zero from the definition of the mean. Assuming that the observations are NV (1, a”), 
the random variables x(n)/o will be N’(jz/o, 1) and hence the random variable 


a2 


wee 1 5 x 
= DL) - fis] (3.6.39) 
n=0 


will be chi squared distributed with vy = N — 1. Therefore, using values of the chi-squared 
distribution, confidence intervals for the variance estimator can be computed. In particular, 
since N ome /o” is distributed as x7, the 95 percent limits of the form 


0.05 0.05 
Pr {xe (| < Néz/a? <x,(1-)| =055 (3.6.40) 


can be obtained from chi-squared tables (Fisher and Yates 1938). By rearranging (3.6.40), 
the random variable o7 /6° satisfies 


N 2 N 
Pr fete = 0.95 (3.6.41) 
Xy(0.975) 6 Xy(0.025) 


Using 1) = N/x,,(0.975) and l2 = N/x,,(0.025), we see that (3.6.41) implies that 


Pr{b62 > 0? and 1,62 < 07} =0.95 (3.6.42) 


Thus the 95 percent confidence interval based on the estimate 6? is ()6%, 162). Note 
that this interval is sensitive to the validity of the normal assumption of random variables 
leading to (3.6.39). This is not the case for the confidence intervals for the mean estimates 


because, thanks to the central limit theorem, the computation of the interval can be based 
on the normal assumption. 


EXAMPLE 3.6.2. Consider again the AR(1) process given in Example 3.6.1: 


x(n) = ax(n — 1)+ w(n) —-l<a<l w(n) ~ WN(O, 1) 
o2 
with y=0 of = 2 and px (1) = al! (3.6.43) 
=a 


We wish to compute the mean of the variance estimator 62 of the process x(n). From (3.6.34), 
we obtain 


N 


1 l 
E[62] =02 fe sD (1-2) a! 


j=—N 


(3.6.44) 


When a — 1, that is, when the dependence between the signal samples increases, the mean 
of the estimate deviates significantly from the true value ot and the quality of the estimator 
decreases drastically. For small dependence, the mean is very close to o. These conclusions 
can be verified using two Monte Carlo simulations as before: one for a = 0.9, which represents 
high correlations among samples, and the other for a = 0.1. Using a Gaussian pseudorandom 
number generator with mean 0 and unit variance, we generated N = 100 samples of the AR(1) 
process x(n). The computed parameters according to (3.6.43) and (3.6.44) are 


a=09: 06% =5.2632 E{62} = 4.3579 


a=01: o%=1.0101  £E{é2} = 0.9978 


We next estimate the variance by using (3.6.33) and repeat the experiment 10,000 times. Fig- 
ure 3.15 shows histograms of computed variances for a = 0.9 and for a = 0.1. The computed 
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Histograms of variance estimates in Example 3.6.2. 
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means of the variance estimates are also shown as dotted lines. Clearly, the histogram is much 
wider for the high-correlation case and much narrower (almost symmetric and Gaussian) for the 
low-correlation case. 

The 95 percent confidence intervals are given by (/ 16%, In6?), where 1} = N/x,,(0.975) 
and ly = N/x, (0.025). The values of /; and /2 are obtained from the chi-squared distribution 
curves (Jenkins and Watts 1968). For N = 100, /; = 0.77 and /y = 1.35; hence the 95 percent 
confidence intervals for 0% are 


(0.776%, 1.3562) 


also shown as dashed lines around the mean value E {62}. The confidence interval for the 
high-correlation case, a = 0.9, does not appear to be a good interval, which implies that the 
approximation leading to (3.6.42) is not a good one for this case. Such is not the case fora = 0.1. 


3.7 SUMMARY 


In this chapter we provided an overview of the basic theory of discrete-time stochastic 
processes. We began with the notion of a random variable as a mapping from the abstract 
probability space to the real space, extended it to random vectors as a collection of random 
variables, and introduced discrete-time stochastic processes as an indexed family (or time 
series) of random variables. A complete probabilistic description of these random objects 
requires the knowledge of joint distribution or density functions, which is difficult to acquire 
except in simple cases. Therefore, the emphasis was placed on description using joint 
moments of distributions, and, in particular, the emphasis was placed on the second-order 
moments, which are relatively easy to estimate or compute in practice. 

We defined the mean and the variance to describe random variables, and we provided 
three useful models of random variables. For random vector description, we defined the 
mean vector and the autocorrelation matrix. Linear transformations of random vectors were 
discussed, using densities and correlation matrices. The normal random vector was then in- 
troduced as a useful model of a random vector. A particularly simple linear transformation, 
namely, the sum of independent random variables, was used to introduce random variables 
with stable and infinitely divisible distributions. To describe stochastic processes, we pro- 
ceeded to define mean and autocorrelation sequences. In many applications, the concept of 
stationary of random processes is a useful one that reduces the computational complexity. 
Assuming time invariance on the first two moments, we defined a wide-sense stationary 
(WSS) process in which the mean is a constant and correlation between random variables 
at two distinct times is a function of time difference or lag. The rest of the chapter was 
devoted to the analysis of WSS processes. 

A stochastic process is generally observed in practice as a single sample function (a 
speech signal or a radar signal) from which it is necessary to estimate the first- and the 
second-order moments. This requires the notion of ergodicity, which provides a framework 
for the computation of statistical averages using time averages over a single realization. 
Although this framework requires theoretical results using mean square convergence, we 
provided a simple approach of using appropriate time averages. An important random signal 
characteristic called variability was introduced. The WSS processes were then described 
in the frequency domain using the power spectral density function, which is a physical 
quantity that can be measured in practice. Some random processes exhibiting flat spectral 
envelopes were analyzed including one of white noise. Since random processes are generally 
processed using linear systems, we described linear system operations with random inputs 
in both the time and frequency domains. 

The properties of correlation matrices and sequences play an important role in filtering 
and estimation theory and were discussed in detail, including eigenanalysis. Another im- 
portant random signal characteristic called memory was also introduced. Stationary random 


signals were modeled using autocorrelation matrices, and the relationship between spectral 
flatness and eigenvalue spread was explored. These properties were used in an alternate rep- 
resentation of random vectors as well as processes using uncorrelated components which 
were based on diagonalization and triangularization of correlation matrices. This resulted 
in the discrete KL transform and KL expansion. These concepts will also be useful in later 
chapters on optimal filtering and adaptive filtering. 

Finally, we concluded this chapter with the introduction of elementary estimation the- 
ory. After discussion of properties of estimators, two important estimators of mean and 
variance were treated in detail along with their sampling distributions. These topics will be 
useful in many subsequent chapters. 


PROBLEMS 
3.1 The exponential density function is given by 
D2 
fe(x) = —e*/4u(x) (P.1) 
a 


where a is a parameter and u(x) is a unit step function. 


(a) Plot the density function for a = 1. 

(b) Determine the mean, variance, skewness,and kurtosis of the Rayleigh random variable with 
a = 1. Comment on the significance of these moments in terms of the shape of the density 
function. 

(c) Determine the characteristic function of the exponential pdf. 


3.2. The Rayleigh density function is given by 
21962 
fee) = e107 u(x) (P.2) 
o 
where o is a parameter and u(x) is a unit step function. Repeat Problem 3.1 foro = 1. 


3.3 Using the binomial expansion of {x(¢) — w,}’", show that the mth central moment is given by 


m 


m 
Me = (T)cptutee, 


k=0 
a“ (m 
Similarly, show that AS = ye G ) uk mM” k 
k=0 


3.4 Consider a zero-mean random variable x(¢). Using (3.1.26), show that the first four cumulants 
of x(¢) are given by (3.1.28) through (3.1.31). 


3.5 Arandom vector x(¢) = [x1 (¢) xa(cy]P has mean vector fy = [1 qr and covariance matrix 


This vector is transformed to another random vector y(¢) by the following linear transformation: 


yi) i, 3 
HOV sk 2 ce 
y3() 2 3 


Determine (a) the mean vector Hy, (b) the autocovariance matrix Ty, and (c) the cross-correlation 
matrix Ryy. 
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3.6 


3.7 


3.8 


3.9 


3.10 


Using the moment generating function, show that the linear transformation of a Gaussian random 
vector is also Gaussian. 


Let {x (o par be four ID random variables with exponential distribution (P.1) with a = 1. 
Let 


k 
ye) = DoS) sk <4 
l=1 


(a) Determine and plot the pdf of y2(¢). 
(b) Determine and plot the pdf of y3(¢). 
(c) Determine and plot the pdf of y4(¢). 
(d) Compare the pdf of y4(¢) with that of the Gaussian density. 


For each of the following, determine whether the random process is (1) WSS or (2) m.s. ergodic 
in the mean. 


(a) X(t) = A, where A is a random variable uniformly distributed between 0 and 1. 
(b) Xn = Acosq@on, where A is a Gaussian random variable with mean 0 and variance 1. 
(c) A Bernoulli process with Pr[X, = 1] = p and Pr[X, = —1] =1-p. 


Consider the harmonic process x (1) defined in (3.3.50). 


(a) Determine the mean of x(n). 
(b) Show that the autocorrelation sequence is given by 


M 
1 
ral = 5 Da Jeg emt —o <Il<o 


Suppose that the random variables @; in the real-valued harmonic process model are distributed 
with a pdf fy, (6%) = C1 + cos b,)/(27), —m < by < 1. Is the resulting stochastic process 
stationary? 


3.11 A stationary random sequence x(n) with mean jz, = 4 and autocovariance 


3.12 


3.13 


(n) 4 — |n| In| <3 

n)= 

us 0 otherwise 

is applied as an input to a linear shift-invariant (LSI) system whose impulse response h(n) is 
h(n) = u(n) — u(n —4) 


where u(7) is a unit step sequence. The output of this system is another random sequence y(n). 
Determine (a) the mean sequence y (n), (b) the cross-covariance y xy (n1, Nz) between x(n1) 
and y(n72), and (c) the autocovariance y y (n1, 2) of the output process y(n). 


A causal LTI system, which is described by the difference equation 
y(n) = 5 y(n — 1) + x(n) + bx(a - 1) 


is driven by a zero-mean WSS process with autocorrelation r; (/) = 0.5! 


(a) Determine the PSD and the autocorrelation of the output sequence y(n). 
(b) Determine the cross-correlation rxy(J) and cross-PSD Rxy (e/®) between the input and 
output signals. 


A WSS process with PSD Ry (ef) = 1/(1.64 + 1.6cosq@) is applied to a causal system 
described by the following difference equation 


y(n) = 0.6y(n — 1) +. x(n) + 1.25x(n — 1) 


Compute (a) the PSD of the output and (5) the cross-PSD Ry (e/ “) between input and output. 


3.14 Determine whether the following matrices are valid correlation matrices: 


1 i 1 
( R=], j (b) Re=|]5 1 5 
L 1 1 4 

4 2 
_ 1 1 4 
Reel" eu (d) Ry=|! A 
Cc = = = 2 5 
soee ee a a ee 


3.15 Consider a normal random vector x(¢) with components that are mutually uncorrelated, that is, 
Pij = 0. Show that (a) the covariance matrix I’y is diagonal and (b) the components of x(¢) 
are mutually independent. 


3.16 Show that if areal, symmetric, and nonnegative definite matrix R has eigenvalues Aj, A2,...,Ay, 
then the matrix R* has eigenvalues a ; a, aos aes 


3.17 Prove that the trace of R is given by 


R= Sod 


3.18 Prove that the determinant of R is given by 


detR = |R| =] ]4; = 1A 


3.19 Show that the determinants of R and F are related by 


detR = detP(1 + wp? Typ) 


3.20 Let Rx be the correlation matrix of the vector x = [x(0) x(2) x(3)]”, where x(1) is a zero-mean 
WSS process. 


(a) Check whether the matrix Rx is Hermitian, Toeplitz, and nonnegative definite. 
(b) If we know the matrix Rx, can we determine the correlation matrix of the vector x = 
[x() x(1) x(2) x3)17? 


3.21 Using the nonnegativeness of E{[x(n + /) + x(n)]7}, show that 7; (0) > |rx(J)| for all 7. 


3.22 Show that r, (/) is nonnegative definite, that is, 


M M 


Soars — bak >0 YM, Vay,....4 
l=1k=1 


3.23 Let x(n) be a random process generated by the AP(1) system 
x(n) = ax(n —1)4+ w(n) n>0 x(—1) =0 


where w(n) is an IID(O, o>.) process. 


(a) Determine the autocorrelation r; (m1, ) function. 
(b) Show that ry(n1,2) asymptotically approaches ry(n1 — 12), that is, it becomes shift- 
invariant. 


3.24 Let x be a random vector with mean jy and autocorrelation Rx. 


(a) Show that y = Q?*x transforms x to an uncorrelated component vector y if Q is the 
eigenmatrix of Rx. 
(b) Comment on the geometric interpretation of this transformation. 
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3.25 The mean and the covariance of a Gaussian random vector x are given by, respectively, 


1 1 5 
py = 5 and Tx = L ; 


Plot the lo, 20, and 30 concentration ellipses representing the contours of the density function 
in the (x,, x2) plane. Hints: The radius of an ellipse with major axis a (along x,) and minor 
axis b < a (along x2) is given by 
Y — 
a? sin” 6 + b? cos? 6 
where 0 < 6 < 27. Compute the lo ellipse specified by a = \/A; and b = ,/A and then rotate 


and translate each point x9 = [xt xSP F 


using the transformation w) = Qyx + fy. 


3.26 Consider the process x(n) = ax(n — 1) + w(n), where w(n) ~ WN(O, gah 


(a) Show that the M x M correlation matrix of the process is symmetric Toeplitz and is given 
by 


7 WwW 
se 1— a? | : 
qn i qit-2 l | 
(b) Verify that 
fl —a 0 0 
-a 1+a? -a 0 
a 1 : | 
1 =- — = 
Ry = o2, 0 a ie | 
l+a2 —a 
LO 0 —a 1 | 


(c) Show that if 


then L?R,Ly = (1 — a2) 

(d) For oe, = 1,a= 0.95, and M = 8 compute the DKLT and the DFT. 

(e) Plot the eigenvalues of each transform in the same graph of the PSD of the process. Explain 
your findings. 

(f) Plot the eigenvectors of each transform and compare the results. 

(g) Repeat parts (e) and (f) for M = 16 and M = 32. Explain the obtained results. 

(h) Repeat parts (e) to (g) for a = 0.5 and compare with the results obtained for a = 0.95. 


3.27 Determine three different innovations representations of a zero-mean random vector x with 


correlation matrix 


3.28 Verify that the eigenvalues and eigenvectors of the M x M correlation matrix of the process 


x(n) = w(n) + bw(n — 1), where w(n) ~ WN(0, o2,) are given by Ag = Ry (es), g = 
sin wen, wR = wk/(M +1), where k = 1,2,..., M, (a) analytically and (b) numerically for 
o%, = | and M = 8. Hint: Plot the eigenvalues on the same graph with the PSD. 


3.29 Consider the process x(n) = w(n) + bw(n — 1). 


(a) Compute the DKLT for M = 3. 
(b) Show that the variances of the DKLT coefficients are o% (1+ 72b), os, and ge (1— 2b). 


3.30 Let pas os a stationary random process with mean jw, and covariance y,(J). Let ft, = 


1/N wae =0 ' <n) be the sample mean from the observations {x an)yN a 


(a) Show that the variance of /1, is given by 


var(jiy) = N! = nee tren! 3 lvxOI (P3) 


l=—N 
(b) Show that the above result (P.3) can be expressed as 
* ee 
var (ly) = 7a + An (ex)] (P.4) 
N 
l (/) 

where Aw(py) =25- (: - x) pr) py) = 

l=1 - Ox 


(c) Show that (P.3) reduces to var (iy) = o%/N fora WN(tLx, o%) process. 


3.31 Let x(n) be a stationary random process with mean j1,, variance on, and covariance y , (/). Let 
1 N-1 
a2 2 
aga Ze) — fay] 
n= 


be the sample variance from the observations fxm Na 


(a) Show that the mean of 6° is given by 


E{62} = 0% —var(fi,) =02 — — > mC - a) y,()) 


(b) Show that the above result reduces to E{é2} = =(N- 1)o2/N fora WN(u,, o2) process. 


3.32 The Cauchy distribution with mean jx is given by 
1 


Sx @) = 7 meen. 


woO< xX < © 


Let {xz eye ; be N IID random variables with the above distribution. Consider the mean 


estimator based on Reo 


I N 
=W So xo) 
k=1 


Determine whether /1(¢) is a consistent estimator of ju. 
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In this chapter we introduce and analyze the properties of a special class of stationary 
random sequences that are obtained by driving a linear, time-invariant system with white 
noise. We focus on filters having a system function that is rational, that is, the ratio of two 
polynomials. The power spectral density of the resulting process is also rational, and its shape 
is completely determined by the filter coefficients. We will use the term pole-zero models 
when we want to emphasize the system viewpoint and the term autoregressive moving- 
average models to refer to the resulting random sequences. The latter term is not appropriate 
when the input is a harmonic process or a deterministic signal with a flat spectral envelope. 
We discuss the impulse response, autocorrelation, power spectrum, partial autocorrelation, 
and cepstrum of all-pole, all-zero, and pole-zero models. We express all these quantities 
in terms of the model coefficients and develop procedures to convert from one parameter 
set to another. Low-order models are studied in detail, because they are easy to analyze 
analytically and provide insight into the behavior and properties of higher-order models. An 
understanding of the correlation and spectral properties of a signal model is very important 
for the selection of the appropriate model in practical applications. Finally, we investigate a 
special case of pole-zero models with one or more unit poles. Pole-zero models are widely 
used for the modeling of stationary signals with short memory whereas models with unit 
poles are useful for the modeling of certain nonstationarity processes with trends. 


4.1 INTRODUCTION 


In Chapter 3 we defined and studied random processes as a mathematical tool to analyze 
random signals. In practice, we also need to generate random signals that possess certain 
known, second-order characteristics, or we need to describe observed signals in terms of 
the parameters of known random processes. 

The simplest random signal model is the wide sense stationary white noise sequence 
w(n) ~ WN(0, o2,) that has uncorrelated samples and a flat PSD. It is also easy to generate 
in practice by using simple algorithms. If we filter white noise with a stable LTT filter, 
we can obtain random signals with almost any arbitrary aperiodic correlation structure or 
continuous PSD. If we wish to generate a random signal with a line PSD using the previous 
approach, we need an LT] filter with “line” frequency response; that is, we need an oscillator. 
Unfortunately, such a system is not stable, and its output cannot be stationary. Fortunately, 
random signals with line PSDs can be easily generated by using the harmonic process model 
(linear combination of sinusoidal sequences with statistically independent random phases) 
discussed in Section 3.3.6. Figure 4.1 illustrates the filtering of white noise and “white ” 
(flat spectral envelope) harmonic process by an LTT filter. Signal models with mixed PSDs 
can be obtained by combining the above two models, a process justified by a powerful result 
known as the Wold decomposition. 
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FIGURE 4.1 
Signal models with continuous and discrete (line) power spectrum 


densities. 


When the LTI filter is specified by its impulse response, we have a nonparametric 
signal model because there is no restriction regarding the form of the model and the number 
of parameters is infinite. However, if we specify the filter by a finite-order rational system 
function, we have a parametric signal model described by a finite number of parameters. We 
focus on parametric models because they are simpler to deal with in practical applications. 
The two major topics we address in this chapter are (1) the derivation of the second-order 
moments of AP, AZ, and PZ models, given the coefficients of their system function, and 
(2) the design of an AP, AZ, or PZ system that produces a random signal with a given 
autocorrelation sequence or PSD function. The second problem is known as signal modeling 
and theoretically is equivalent to the spectral factorization procedure developed in Section 
2.4.4. The modeling of harmonic processes is theoretically straightforward and does not 
require the use of a linear filter to change the amplitude of the spectral lines. The challenging 
problem in this case is the identification of the filter by observing its response to a harmonic 
process with a flat PSD. The modeling problem for continuous PSDs has a solution, at least 
in principle, for every regular random sequence. 

In practical applications, the second-order moments of the signal to be modeled are 
not known a priori and have to be estimated from a set of signal observations. This el- 
ement introduces a new dimension and additional complications to the signal modeling 
problem, which are discussed in Chapter 9. In this chapter we primarily focus on paramet- 
ric models that replicate the second-order properties (autocorrelation or PSD) of stationary 
random sequences. If the sequence is Gaussian, the model provides a complete statistical 
characterization. The characterization of non-Gaussian processes, which requires the use 
of higher-order moments, is discussed in Chapter 12. 


4.1.1 Linear Nonparametric Signal Models 


Consider a stable LTI system with impulse response /A(m) and input w(n). The output x(n) 
is given by the convolution summation 
[o,@) 
x(n) = > h(k)w(n — k) (4.1.1) 
k=—0o 
which is known as a nonrecursive system representation because the output is computed 
by linearly weighting samples of the input signal. 


Linear random signal model. If the input w() is a zero-mean white noise process 
with variance o., autocorrelation ry, (1) = o2.5(1), and PSD Ry (e/®) = Cee —-I <<, 
then from Table 3.2 the autocorrelation, complex PSD, and PSD of the output x (7) are given 
by, respectively, 


rx (l) = 02, ye h(k)h* (k — 1) = 02 rn) (4.1.2) 
k=—0o 
R,.(z) = 02, H(z) H* (=) (4.1.3) 
z 
R,(e!®) = 02, |H(e!®)/? = 02, Ry(e!”) (4.1.4) 


We notice that when the input is a white noise process, the shape of the autocorrelation 
and the power spectrum (second-order moments) of the output signal are completely char- 
acterized by the system. We use the term system-based signal model to refer to the signal 
generated by a system with a white noise input. If the system is linear, we use the term 
linear random signal model. In the statistical literature, the resulting model is known as 
the general linear process model. However, we should mention that in some applications 
it is more appropriate to use a deterministic input with flat spectral envelope or a “white” 
harmonic process input. 


Recursive representation. Suppose now that the inverse system H7(n) = 1/H(z) 
is causal and stable. If we assume, without any loss of generality, that h(O) = 1, then 
hy(n) = 2-'{Hy (n)} has h;(0) = 1. Therefore the input w(n) can be obtained by 


[o,@) 
w(n) = x(n) + Shy (k)x(n — k) (4.1.5) 
k=1 
Solving for x(n), we obtain the following recursive representation for the output signal 


[o,@) 
x(n) = — So hy(k)x(n — k) + w(n) (4.1.6) 
k=1 
We use the term recursive representation to emphasize that the present value of the output 
is obtained by a linear combination of all past output values, plus the present value of the 
input. By construction the nonrecursive and recursive representations of system h(n) are 
equivalent; that is, they produce the same output when they are excited by the same input 
signal. 


Innovations representation. If the system H(z) is minimum-phase, then both h(n) 
and h;(n) are causal and stable. Hence, the output signal can be expressed nonrecursively 
by 

CO n 
x(n) =) h({kywn-kh = > h—bwik) (4.1.7) 
k=0 k=—0o 
or recursively by (4.1.6). 
From (4.1.7) we obtain 
n 
x(atl)= Yo ha+1—hHwk) + w+) 
k=—0o 
or by using (4.1.5) 
n 
x(nt+1l)= Y> a(n 41 —b/)x(k) + wnt) (4.1.8) 
——<$< 


k=—00 new information 


past information: linear combination of x(n), x(n—1),... 
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Careful inspection of (4.1.8) indicates that if the system generating x (1) is minimum-phase, 
the sample w(n-+ 1) brings all the new information (innovation) to be carried by the sample 
x(n + 1). All other information can be predicted from the past samples x(n), x(n — 1),... 
of the signal (see Section 6.6). We stress that this interpretation holds only if H(z) is 
minimum-phase. 

The system H (z) generates the signal x (7) by introducing dependence in the white noise 
input w(n) and is known as the synthesis or coloring filter. In contrast, the inverse system 
H(z) can be used to recover the input w(7) and is known as the analysis or whitening filter. 
In this sense the innovations sequence and the output process are completely equivalent. 
The synthesis and analysis filters are shown in Figure 4.2. 


w(n) ~ IDO, 7) x(n) Synthesisor FIGURE 4200 
A(z) coloring filter Synthesis and analysis filters used in 
innovations representation. 


x(n) Peele w() Analysis or 
1) = AEG whitening filter 


Spectral factorization 
Most random processes with a continuous PSD R,(e/”) can be generated by exciting 
a minimum-phase system Hypin(z) with white noise. The PSD of the resulting process is 
given by 
Ry (e!) = 0%, |Hmin (C7)? (4.1.9) 
The process of obtaining Hypin(z) from Rx (e/”) or ry (L) is known as spectral factorization. 
If the PSD R,.(e/”) satisfies the Paley-Wiener condition 


a 
/ |In Ry(e/”)|da < 00 (4.1.10) 
—0 
then the process x (m) is called regular and its complex PSD can be factored as follows (see 
Section 2.4.4) 


1 
Ry@)= es fe Cal «em (=) (4.1.11) 
1 7 . 
where o%, = exp { — / In[R, (e/”)] daw (4.1.12) 
2m Jen 


is the variance of the white noise input and can be interpreted as the geometric mean of 
R,(e/”). Consider the inverse Fourier transform of In R, (e/”): 


ot). = =f. In[Ry(e/®)] ef? daw (4.1.13) 


which is a sequence known as the cepstrum of r,(l). Note that c(0) = ei: Thus in the 
cepstral domain, the multiplicative factors Hyin(z) and H*,,(1/z*) are now additively 
separable due to the natural logarithm of R,(e/”). Define 


ci (k) & o + c(k)u(k — 1) (4.1.14) 
and c_(k) £ o + c(k)u(—k — 1) (4.1.15) 


as the positive- and negative-axis projections of c(k), respectively, with c(Q) distributed 
equally between them. Then we obtain 


Imin(n) = Fol {exp Fei )]} (4.1.16) 


as the impulse response of the minimum-phase system Hypin(z). Similarly, 
Amax(n) = F {exp Fle_(k)]} (4.1.17) 


is the corresponding maximum-phase system. This completes the spectral factorization 
procedure for an arbitrary PSD R,.(e/®), which, in general, is a complicated task. However, 
it is straightforward if R,(z) is a rational function, as we discussed in Section 2.4.2. 


Spectral flatness measure 


The spectral flatness measure (SFM) of a zero-mean process with PSD R, (e/ “) is 
defined by (Makhoul 1975) 


1 {” . 
exp {= | infRe(e!”)] do] 
QI) fitz oO 
1 7” ‘ep o 
—— R,(e!®) dw 
250 fax 
where the second equality follows from (4.1.12). It describes the shape (or more appro- 


priately, flatness) of the PSD by a single number. If x(n) is a white noise process, then 
R,(e/®) = a and SFM, = 1. More specifically, we can show that 


0 < SFM, < 1 (4.1.19) 


(4.1.18) 


euler 


Observe that the numerator of (4.1.18) is the geometric mean while the denominator is the 
arithmetic mean of a real-valued, nonnegative continuous waveform R,, (e/”). Since x(n) 
is a regular process satisfying (4.1.10), these means are always positive. Furthermore, their 
ratio, by definition, is never greater than unity and is equal to unity if the waveform is 
constant. This, then, proves (4.1.19). A detailed proof is given in Jayant and Noll (1984). 
When x(7) is obtained by filtering the zero-mean white noise process w(n) through 
the filter H(z), then the coloring of R,(e/”) is due to H(z). In this case, Ry(e/?) = 
o*, |H(e/®)|? from (4.1.9), and we obtain 
e, o?, 1 
ae ee ja,)2 ef jw 2 
— a1, |H(e")|- da  — |H(e“)|~ dw 
20 —1 20 —1 
Thus SFM, is the inverse of the filter power (or power transfer factor) if h(O) is normalized 
to unity. 


SFM, = (4.1.20) 


4.1.2 Parametric Pole-Zero Signal Models 


Parametric models describe a system with a finite number of parameters. The major subject 
of this chapter is the treatment of parametric models that have rational system functions. To 
this end, consider a system described by the following linear constant-coefficient difference 
equation 


P Q 
x(n) + Yay x(n —k) = So dk win — k) (4.1.21) 
k=1 k=0 


where w(n) and x(n) are the input and output signals, respectively. Taking the z-transform 
of both sides, we find that the system function is 


Q 
dee 
_X@) _—k=0 a D(@) 
HOS a = - a (4.1.22) 
1+) agz7* 
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We can express H(z) in terms of the poles and zeros of the system as follows: 


Q 
[ [a — zz) 


H(z) = do — 


[ [a= pez) 
k=1 


The system has Q zeros {zx} and P poles {p;} (zeros and poles at z = 0 are not considered 
here). The term dp is the system gain. For the rest of the book, we assume that the polynomials 
D(z) and A(z) do not have any common roots, that is, common poles and zeros have already 
been canceled. 


(4.1.23) 


Types of pole-zero models 


There are three cases of interest: 


For P > O and Q > O, we have a pole-zero model, denoted by PZ(P, Q). If the model 
is assumed to be causal, its output is given by 


P Q 
x(n) = = SS agx(n —k) +S) dkw(n —k) (4.1.24) 
k=1 k=0 
For P = 0, we have an all-zero model, denoted by AZ(Q). The input-output difference 
equation is 


OQ 
x(n) = So dwn sh) (4.1.25) 
k=0 


For Q = 0, we have an all-pole model, denoted by AP(P). The input-output difference 
equation is 
P 
x(n) = — = arx(n — k) + dow(n) (4.1.26) 
k=1 
If we excite a parametric model with white noise, we obtain a signal whose second- 
order moments are determined by the parameters of the model. Indeed, from Sections 3.4.2 
and 3.4.3, we recall that if w(n) ~ IID{0, 02} with finite variance,’ then 


ry) = 02 7, = 02,h@ *h*(-1) (4.1.27) 

Ry (Zz) = 07, Ry(z) = 02,H (2) H* (=) (4.1.28) 
z 

R,(e!®) = 02, Ry (e/®) = 0, |H (e/®) |? (4.1.29) 


Such signal models are of great practical interest and have special names in the statistical 
literature: 


e The AZ(Q) is known as the moving-average model, denoted by MA(Q). 

e The AP(P) is known as the autoregressive model, denoted by AR(P). 

e The PZ(P, Q) is known as the autoregressive moving-average model, denoted by 
ARMA(P, Q). 


We specify a parametric signal model by normalizing dy) = 1 and setting the variance of 
the input to a. The defining set of model parameters is given by {a), a2,...,ap,dj,..., 
dg, o7,} (see Figure 4.3). An alternative is to set o, = 1 and leave dp arbitrary. We stress 
that these models assume the resulting processes are stationary, which is ensured if the 
corresponding systems are BIBO stable. 


"The case of infinite variance is discussed in Chapter 12. 


FIGURE 4.3 
Output Block diagram representation of a 
x(n) parametric, rational signal model. 


Input 


w(n) 
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Short-memory behavior 


To find the memory behavior of pole-zero models, we investigate the nature of their 
impulse response. To this end, we recall that for Q > P, (4.1.23) can be expanded as 


Q-P P a 
or k 
A(z) = Bjz i+ —— 4.1.30 
where for simplicity we assume that the model has P distinct poles. The first term in (4.1.30) 
disappears if P > Q. The coefficients B; can be obtained by long division: 


Ag = (1 — pez) H (2) e= py (4.1.31) 
If the model is causal, taking the inverse z-transform results in an impulse response that is a 


linear combination of impulses, real exponentials, and damped sinusoids (produced by the 
combination of complex exponentials) 


Q-P Pi Py 
h(n) = > Bjd(n— j) + S> Ax(pe)" Un) + Y° Cir? cos(wjin + ;)u(n) (4.1.32) 
7=0 k=1 i=1 


where pj = rjetJ®i and P = P; + 2P». Recall that u(n) and 8(n) are the unit step and 

unit impulse functions, respectively. We note that the memory of any all-pole model decays 

exponentially with time and that the rate of decay is controlled by the pole closest to the 

unit circle. The contribution of multiple poles at the same location is treated in Problem 4.1. 
Careful inspection of (4.1.32) leads to the following conclusions: 


1. For AZ(Q) models, the impulse response has finite duration and, therefore, can have any 
shape. 

2. The impulse response of causal AP(P) and PZ(P, Q) models with single poles consists 
of a linear combination of damped real exponentials (produced by the real poles) and 
exponentially damped sinusoids (produced by complex conjugate poles). The rate of 
decay decreases as the poles move closer to the unit circle and is determined by the pole 
closest to the unit circle. 

3. The model is stable if and only if A(m) is absolutely summable, which, due to (4.1.32), 
is equivalent to |px| < 1 for all k. In other words, a causal pole-zero model is BIBO 
stable if and only if all the poles are inside the unit circle.’ 


We conclude that causal, stable PZ(P, Q) models with P > 0 have an exponentially 
fading memory because their impulse response decays exponentially with time. Therefore, 
the autocorrelation rj, (/) = h(1) * h*(—1) also decays exponentially (see Example 4.2.2), 
and pole-zero models have short memory according to the definition given in Section 3.4.3. 


Generation of random signals with rational power spectra 


Sample realizations of random sequences with rational power spectra can be easily 
generated by using the difference equation (4.1.24) and a random number generator. In 
most applications, we use a Gaussian excitation because the generated sequence will also 
be Gaussian. For non-Gaussian inputs, it is difficult to predict the type of distribution 
of the output signal. If, on one hand, we specify the frequency response of the model, 


"Poles on the unit circle are discussed in Section 4.5. 
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the coefficients of the difference equation can be obtained by using a digital filter design 
package. If, on the other hand, the power spectrum or the autocorrelation is given, the 
coefficients of the model are determined via spectral factorization. If we wish to avoid 
the transient effects that make some of the initial output samples nonstationary, we should 
consider the response of the model only after the initial transients have died out. 


4.1.3 Mixed Processes and Wold Decomposition 


An arbitrary stationary random process can be constructed to possess a continuous PSD 
R,(e/®) and a discrete power spectrum R,(k). Such processes are called mixed processes 
because the continuous PSD is due to regular processes while the discrete spectrum is due to 
harmonic (or almost periodic) processes. A further interpretation of mixed processes is that 
the first part is an unpredictable process while the second part is a predictable process (in the 
sense that past samples can be used to exactly determine future samples). This interpretation 
is due to the Wold decomposition theorem. 


THEOREM 4.1 (WOLD DECOMPOSITION). A general stationary random process can be 
written as a sum 

X(N) = X(N) + Xp(n) (4.1.33) 
where x;() is a regular process possessing a continuous spectrum and xp(7) is a predictable 
process possessing a discrete spectrum. Furthermore, x; (7) is orthogonal to xp (7); that is, 


E{xr(n1)xp (n2)} =0 for alln,,n2 (4.1.34) 


The proof of this theorem is very involved, but a good approach to it is given in Therrien 
(1992). Using (4.1.34), the correlation sequence of x(n) in (4.1.33) is given by 


KO) =rx,D + rx, O 
from which we obtain the continuous and discrete spectra. As discussed above, the regular 
process has an innovations representation w(n) that is uncorrelated but not independent. 
For example, w(n) can be the output of an all-pass filter driven by an IID sequence. To 
determine if this is the case, we need to use higher-order moments (see Section 12.1). 


4.2 ALL-POLE MODELS 


We start our discussion of linear signal models with all-pole models because they are the 
easiest to analyze and the most often used in practical applications. We assume an all-pole 
model of the form 


d d d 
HOR 0 0 0 


A@ P ~ "Pp 
1+ So age-* [J = pez) 
k=1 k=1 


where do is the system gain and P is the order of the model. The all-pole model can be 
implemented using either a direct or a lattice structure. The conversion between the two 
sets of parameters can be done by using the step-up and step-down recursions described in 
Section 2.5. 


(4.2.1) 


4.2.1 Model Properties 


In this section, we derive analytic expressions for various properties of the all-pole model, 
namely, the impulse response, the autocorrelation, and the spectrum. We determine the 
system-related properties r;,(/) and R,(e/”) because the results can be readily applied to 
obtain the signal model properties for inputs with both continuous and discrete spectra. 


Impulse response. The impulse response h(n) can be specified by first rewriting 
(4.2.1) as 


P 
H(z) +) axrH(@) z* = do 
k=1 


and then taking the inverse z-transform to obtain 


P 
h(n) + y axh(n — k) = dod(n) (4.2.2) 
k=1 
If the system is causal, then 
P 
h(n) = — )° agh(n — k) + dod(n) (4.2.3) 
k=1 


If H(z) has all its poles inside the unit circle, then (7) is a causal, stable sequence and the 
system is minimum-phase. From (4.2.3) we have 


h(O) = do (4.2.4) 
P 
h(n)=—Jiahn-k)  n>0 (4.2.5) 
k=1 
and owing to causality we have 
h(n) =0 n<0 (4.2.6) 


Thus, except for the value at n = 0, h(n) can be obtained recursively as a linearly weighted 
summation of its previous values h(n — 1),..., h(n — P). One can say that h(n) can be 
predicted (with zero error for n 4 0) from the past P values. Thus, the coefficients {a;} 
are often referred to as predictor coefficients. Note that there is a close relationship between 
all-pole models and linear prediction that will be discussed in Section 4.2.2. 

From (4.2.4) and (4.2.5), we can also write the inverse relation 


h(n) n—1 


7 — h(n—k) 
a= La 7 l<n<P (4.2.7) 


with ag = 1. From (4.2.7) and (4.2.4), we conclude that if we are given the first P + 1 
values of the impulse response h(n), 0 < n < P, then the parameters of the all-pole filter 
are completely specified. 

Finally, we note that a causal H(z) can be written as a one-sided, infinite polynomial 
A(z) = eed h(n)z~”. This representation of H(z) implies that any finite-order, all-pole 
model can be represented equivalently by an infinite number of zeros. In general, a single 
pole can be represented by an infinite number of zeros, and conversely a single zero can be 
represented by an infinite number of poles. If the poles are inside the unit circle, so are the 
corresponding zeros, and vice versa. 


EXAMPLE 4.2.1. A single pole at z = a can be represented by 
1 [o,@) 
A(z) = ——— = a’z” a| <1 4.2.8 
CL wares dX la| (4.2.8) 


The question is, where are the infinite number of zeros located? To find the answer, let us consider 
the finite polynomial 
N 
Hy) = do a"z (4.2.9) 
n=0 


157 


SECTION 4.2 
All-Pole Models 


158 


CHAPTER 4 
Linear Signal Models 


where we have truncated H(z) atn = N. Thus Hy (z) is a geometric series that can be written 
in closed form as 
1 —q@Nt1,-W+) 
Ay (z) = = (4.2.10) 
1 —az— 
And Hy (z) has a single pole at z = a and N + 1 zeros at 
zeacdlttNt) = 5 =0,1,...,N (4.2.11) 


The N + 1 zeros are equally distributed on the circle |z| = a with one of the zeros (for i = 0) 
located at z = a. But the zero at z = a cancels the pole at the same location. Therefore, Hy (z) 


has the remaining N zeros: 
z=ael2™NtI) 5 = 1,2,...,N (4.2.12) 


The transfer function H(z) of the single-pole model is obtained from Hy (z) by letting N go 
to infinity. In the limit, Hoo (z) has an infinite number of zeros equally distributed on the circle 


|z| = a; the zeros are everywhere on that circle except at the point z = a. Similarly, the 
denominator from (4.2.8), a polynomial with a single zero at z = a, can be written as 
A(z) =1 Ss : : ja| <1 (4.2.13) 
zy=1-—az — — a\< Le 
A(z) & 
1+ oat 
n=1 


that is, a single zero can also be represented by an infinite number of poles. In this case, the 
poles are equally distributed on a circle that passes through the location of the zero; the poles 
are everywhere on the circle except at the actual location of the zero. 


Autocorrelation. The impulse response h(n) of an all-pole model has infinite dura- 
tion so that its autocorrelation involves an infinite summation, which is not practical to 
write in closed form except for low-order models. However, the autocorrelation function 
obeys a recursive relation that relates the autocorrelation values to the model parameters. 
Multiplying (4.2.2) by h*(n — 1) and summing over all n, we have 


lee) P lee) 
2 2 ah(n — k)h*(n —1) = do > h*(n —1)8(n) (4.2.14) 


n=—co k=—0 n=—0o 
where ao = 1. Interchanging the order of summations in the left-hand side, we obtain 


P 
Yo axrn(l—k) = doh*(-l) 00 <1 <0 (4.2.15) 
k=0 


where r;,(/) is the autocorrelation of h(n). Equation (4.2.15) is true for all /, but because 
h(l) = 0 forl < 0, h(—1) = 0 for] > 0, and we have 


P 
> agrn(l —k) = 0 1>0 (4.2.16) 
k=0 

From (4.2.4) and (4.2.15), we also have for / = 0, 

P 
S— agrn(—k) = Idol? (4.2.17) 
k=0 


where we used the fact that rh (—l) = rp(Z). Equation (4.2.16) can be rewritten as 


P 


ml =— > iarnd-k  1>0 (4.2.18) 
k=1 


which is a recursive relation for r; (J) in terms of past values of the autocorrelation and {ax}. 
Relation (4.2.18) for r;(J) is similar to relation (4.2.5) for h(n), but with one important 
difference: (4.2.5) for h(n) is true for alln 4 0 while (4.2.18) for r,() is true only if] > 0; 
for 1 < 0, r,() obeys (4.2.15). 

If we define the normalized autocorrelation coefficients as 


rnd) 
Lh)= 4.2.19 
Pr(l) (0) ( ) 
then we can divide (4.2.17) by r;,(0) and deduce the following relation for rp (0) 
dol? 
rn (0) = > (4.2.20) 
1+ 9 aKpn(k) 
k=1 


which is the energy of the output of the all-pole filter when excited by a single impulse. 


Autocorrelation in terms of poles. The complex spectrum of the AP(P) model is 


1 
Pez—")(1 — pez*) 


1 P 
R,(z) = H(z)H (=) = |dl* [] ae (4.2.21) 
k=1 

Therefore, the autocorrelation sequence can be expressed in terms of the poles by taking the 
inverse z-transform of Rj (z), that is, r, (1) = Z—!{R,(z)}. The poles p; of the minimum- 
phase model A(z) contribute causal terms in the partial fraction expansion, whereas the 
poles 1/p, of the nonminimum-phase model H (1/z*) contribute noncausal terms. This is 
best illustrated with the following example. 


EXAMPLE 4.2.2. Consider the following minimum-phase AP(1) model 
1 
H(z) = ————— -l<a<l (4.2.22) 
l+az7} 
Owing to causality, the ROC of H(z) is |z| > |a|. The z-transform 
1 


-1 
H = -—1 1 4.2.2 
(z-) ioe <a< ( 3) 


corresponds to the noncausal sequence h(—n) = (—a)~"u(—n), and its ROC is |z| < 1/|al. 
Hence, 


1 


R,(z) = H(2) H(z!) = 
n@) (A(z) Gage Gas 


(4.2.24) 


which corresponds to a two-sided sequence because its ROC, |a| < |z| < 1/|a|, is a ring in the 
z-plane. Using partial fraction expansion, we obtain 


gt 1 1 


—a 
+ 4.2.25 
1—a?21+az7! l—a2 1l+az ( ) 


Rp) = 


The pole p = —a corresponds to the causal sequence [1/(1 — a*)|(—a)!u(l — 1), and the pole 


Pp = —1/a to the anticausal sequence [1/(1 — a2)\(—a)—u(—1). Combining the two terms, we 
obtain 
1 
rn (2) = ——> (-a) "4 —0 <I <0oo (4.2.26) 
1—a?2 
or ppl) = (—a)!! -~w<l<o (4.2.27) 


Note that complex conjugate poles will contribute two-sided damped sinusoidal terms 
obtained by combining pairs of the form (4.2.27) with u = panda = p”. 
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Impulse train excitations. The response of an AP(P) model to a periodic impulse 
train with period L is periodic with the same period and is given by 


P lee) 
h(n) +)  agh(n—k) =do D> 81+ Lm) 
ses ames (4.2.28) 
a do n+Lm=0 
~ 0 n+Lm #40 
which shows that the prediction error is zero for samples inside the period and do at the 


beginning of each period. If we multiply both sides of (4.2.28) by h*(n — 1) and sum over 
a period 0 <n < L — 1, we obtain 


P 
~ a) do ~ 
I l—k) = —h*(-l Ul 4.2.29 
rn + D aati y=ThD a ( ) 


where 7;,(/) is the periodic autocorrelation of h(n). Since, in contrast to h(n) in (4.2.15), 
h(n) is not necessarily zero for n < 0, the periodic autocorrelation 7, (J) will not in general 
obey the linear prediction equation anywhere. Similar results can be obtained for harmonic 
process excitations. 


Model parameters in terms of autocorrelation. Equations (4.2.15) for! = 0,1,..., 
P comprise P + 1 equations that relate the P + 1 parameters of H(z), namely, do and 


{ag, 1 < k < P}, to the first P + 1 autocorrelation coefficients r,(0), r,(1),...,rn(P). 
These P + | equations can be written in matrix form as 
rn) 17,(1) yin, (PD 1 |dol? 
may Oe seem tP Ny at 1) sat 
ry(P) r(P ~ 1) i (0) ap 0 


If we are given the first P + 1 autocorrelations, (4.2.30) comprises a system of P + | linear 
equations, with a Hermitian, Toeplitz matrix that can be solved for dp and {ax}. 

Because of the special structure in (4.2.30), the model parameters are found from the 
autocorrelations by using the last set of P equations in (4.2.30), followed by the computation 
of do from the first equation, which is the same as (4.2.17). From (4.2.30), we can write in 
matrix notation 


R,a = —-Yrp, (4.2.31) 


where Ry, is the autocorrelation matrix, a is the vector of the model parameters, and rp, 
is the vector of autocorrelations. Since r,(/) = st at (1), we can also express the model 
parameters in terms of the autocorrelation 7, (/) of the output process x(n) as follows: 


R,a = —r, (4.2.32) 


These equations are known as the Yule-Walker equations in the statistics literature. In the 
sequel, we drop the subscript from the autocorrelation sequence or matrix whenever the 
analysis holds for both the impulse response and the model output. 

Because of the Toeplitz structure and the nature of the right-hand side, the linear systems 
(4.2.31) and (4.2.32) can be solved recursively by using the algorithm of Levinson-Durbin 
(see Section 7.4). After a is solved for, the system gain dp can be computed from (4.2.17). 

Therefore, given r(O), r(1),...,r(P), we can completely specify the parameters of 
the all-pole model by solving a set of linear equations. Below, we will see that the converse 
is also true: Given the model parameters, we can find the first P + 1 autocorrelations by 


solving a set of linear equations. This elegant solution of the spectral factorization problem 
is unique to all-pole models. In the case in which the model contains zeros (Q 4 0), the 
spectral factorization problem requires the solution of a nonlinear system of equations. 


Autocorrelation in terms of model parameters. If we normalize the autocorrelations 
in (4.2.31) by dividing throughout by r(0), we obtain the following system of equations 


Pa=-—p¢ (4.2.33) 
where P is the normalized autocorrelation matrix and 
p =[p(1) p(2) «++ p(Py|” (4.2.34) 


is the vector of normalized autocorrelations. This set of P equations relates the P model 
coefficients with the first P (normalized) autocorrelation values. If the poles of the all-pole 
filter are strictly inside the unit circle, the mapping between the P-dimensional vectors a and 
p is unique. If, in fact, we are given the vector a, then the normalized autocorrelation vector 
p can be computed from a by using the set of equations that can be deduced from (4.2.33) 


Ap=—a (4.2.35) 


where (A);; = aj—_; + aij, assuming a, = 0 form <0 and m > P (see Problem 4.6). 
Given the set of coefficients in a, p can be obtained by solving (4.2.35). We will see that, 
under the assumption of a stable H(z), a solution always exists. Furthermore, there exists a 
simple, recursive solution that is efficient (see Section 7.5). If, in addition to a, we are given 
do, we can evaluate r (0) with (4.2.20) from p computed by (4.2.35). Autocorrelation values 
r(l) for lags 1 > P are found by using the recursion in (4.2.18) with r(O),7(1),...,r(P). 


EXAMPLE 4.2.3. For the AP(3) model with real coefficients we have 


r(0) rl) rQ)}} a r(1) 
rd) r@) rd)}}|a.} =—-]r@) (4.2.36) 
r(2) rl) r(Q) a3 r(3) 
di =r (0) + ayr(1) + agr(2) +4373) (4.2.37) 


Therefore, given r(0), r(1), (2), (3), we can find the parameters of the all-pole model by 
solving (4.2.36) and then substituting into (4.2.37). 

Suppose now that instead we are given the model parameters do, a1, a2, a3. If we divide 
both sides of (4.2.36) by r (0) and solve for the normalized autocorrelations o(1), o(2), and p(3), 


we obtain 
l+a az 0 p(i) ay 
ajt¢az 1 0 p22) |=-|] @ (4.2.38) 
a2 a, 1 p(3) a3 


The value of r(O) is obtained from 
2 
do 
1+ a,p(1) + a2p(2) + a30(3) 
If r(O) = 2, r(1) = 1.6, r(2) = 1.2, and r(3) = 1, the Toeplitz matrix in (4.2.36) is positive 
definite because it has positive eigenvalues. Solving the linear system gives aj = —0.9063, 


az = 0.2500, and a3 = —0.1563. Substituting these values in (4.2.37), we obtain dy = 0.8329. 
Using the last two relations, we can recover the autocorrelation from the model parameters. 


r(0) = 


(4.2.39) 


Correlation matching. All-pole models have the unique distinction that the model 
parameters are completely specified by the first P + 1 autocorrelation coefficients via a set 


of linear equations. We can write 
d 0 
gs b ' (4.2.40) 
a p 
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that is, the mapping of the model parameters {do, a1, a2,..., ap} to the autocorrelation 
coefficients specified by the vector {r(0), o(1),..., e(P)} is reversible and unique. This 
statement implies that given any set of autocorrelation values r(0), r(1),..., 7(P), we can 
always find an all-pole model whose first P + | autocorrelation coefficients are equal to the 
given autocorrelations. This correlation matching of all-pole models is quite remarkable. 
This property is not shared by all-zero models and is true for pole-zero models only under 
certain conditions, as we will see in Section 4.4. 


Spectrum. The z-transform of the autocorrelation r(/) of H(z) is given by 
1 
R(z) = H(z)H* (=) (4.2.41) 


The spectrum is then equal to 


ldo? 
|A(es®) | 


The right-hand side of (4.2.42) suggests a method for computing the spectrum: First compute 
A(e/) by taking the Fourier transform of the sequence {1,a),...,ap}, then take the 
squared of the magnitude and divide |do|* by the result. The fast Fourier transform (FFT) 
can be used to this end by appending the sequence {1, a;,..., ap} with as many zeros as 
needed to compute the desired number of frequency points. 


R(e/®) = |H(e/®)? = (4.2.42) 


Partial autocorrelation and lattice structures. We have seen that an AP( P) model is 
completely described by the first P + 1 values of its autocorrelation. However, we cannot 
determine the order of the model by using the autocorrelation sequence because it has 
infinite duration. Suppose that we start fitting models of increasing order m, using the 
autocorrelation sequence of an AP(P) model and the Yule-Walker equations 


1 pr(l) +++ p*(m—1)] fa” p*(1) 
: (m) * 
1 1 ee: p*(2) 
v @) ae (4.2.43) 
: : p* (1) : : : 
p(m—1) +. p(y) 1 ae p*(m) 
Since a”) = 0 form > P, wecanuse the sequence a”), m = 1,2, ..., whichis known as 


the partial autocorrelation sequence (PACS), to determine the order of the all-pole model. 
Recall from Section 2.5 that 


a™ = km (4.2.44) 


that is, the PACS is identical to the lattice parameters. A statistical definition and interpre- 
tation of the PACS are also given in Section 7.2. The PACS can be defined for any valid 
(i.e., positive definite) autocorrelation sequence and can be efficiently computed by using 
the algorithms of Levinson-Durbin and Schur (see Chapter 7). 

Furthermore, it has been shown (Burg 1975) that 


kn 
r(0) 7 HEI : < R(e/”) < r(0) pe 


m=1 pote 


mane | 
IKm| 


which indicates that the spectral dynamic range increases if some lattice parameter moves 
close to 1 or equivalently some pole moves close to the unit circle. 


(4.2.45) 


Equivalent model representations. From the previous discussions (see also Chapter 
7) we conclude that a minimum-phase AP( P) model can be uniquely described by any one 
of the following representations: 


1. Direct structure: {do, aj, a2,..., ap} 
2. Lattice structure: {do, kj, k2,..., kp} 
3. Autocorrelation: {r(0),r(1),...,r(P)} 


where we assume, without loss of generality, that dy) > 0. Note that the minimum-phase 
property requires that all poles be inside the unit circle or all |k,| < 1 or that Rp+1 be 
positive definite. The transformation from any of the above representations to any other can 
be done by using the algorithms developed in Section 7.5. 


Minimum-phase conditions. As we will show in Section 7.5, if the Toeplitz matrix 
Ry (or equivalently R,) is positive definite, then |k,,| < 1 for allm = 1,2,..., P. There- 
fore, the AP(P) model obtained by solving the Yule-Walker equations is minimum-phase. 
Therefore, the Yule-Walker equations provide a simple and elegant solution to the spectral 
factorization problem for all-pole models. 


EXAMPLE 4.2.4. The poles of the model obtained in Example 4.2.3 are 0.8316, 0.0373+0.4319i, 
and 0.0373 — 0.43197. We see that the poles are inside the unit circle and that the autocorrelation 
sequence is positive definite. If we set r;,(2) = —1.2, the autocorrelation becomes negative 
definite and the obtained model a =[1 — 1.222 1.1575]", dy = 2.2271, is nonminimum-phase. 


Pole locations. The poles of H(z) are the zeros {px} of the polynomial A(z). If the 
coefficients of A(z) are assumed to be real, the poles are either real or come in complex 
conjugate pairs. In order for H(z) to be minimum-phase, all poles must be inside the unit 
circle, that is, |px| < 1. The model parameters a, can be written as sums of products of the 
poles p x. In particular, it is easy to see that 


P 
a,=—)> De (4.2.46) 
k=1 
P 
ap = [[c Pk) (4.2.47) 
k=1 


Thus, the first coefficient a, is the negative of the sum of the poles, and the last coefficient 
ap is the product of the negative of the individual poles. Since |px| < 1, we must have 
lap| < 1 for a minimum-phase polynomial for which ag = 1. However, note that the 
reverse is not necessarily true: |ap| < 1 does not guarantee minimum phase. The roots px 
can be computed by using any number of standard root-finding routines. 


4.2.2 All-Pole Modeling and Linear Prediction 


Consider the AP(P) model 


P 
x(n) =— So agx(n — k) + w(n) (4.2.48) 
k=1 


Now recall from Chapter | that the Mth-order linear predictor of x(n) and the corresponding 
prediction error e(”) are 


M 
&(n) = —)\apx(n —k) (4.2.49) 
k=1 
M 
e(n) = x(n) —X(n) = x(n) + ye abx(n —k) (4.2.50) 


c=1 
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M 
or x(n) = — ye alx(n —k)+e(n) (4.2.51) 
k=1 
Notice that if the order of the linear predictor equals the order of the all-pole model (M = P) 
and if a? = ax, then the prediction error is equal to the excitation of the all-pole model, 
that is, e(n) = w(n). Since all-pole modeling and FIR linear prediction are closely related, 
many properties and algorithms developed for one of them can be applied to the other. 
Linear prediction is extensively studied in Chapters 6 and 7. 


4.2.3 Autoregressive Models 


Causal all-pole models excited by white noise play a major role in practical applications 
and are known as autoregressive (AR) models. An AR(P) model is defined by the difference 
equation 
P 
x(n) = — S agx(n — k) + w(n) (4.2.52) 
k=1 
where {w(n)} ~ WN(O, o2,). An AR(P) model is valid only if the corresponding AP(P) 
system is stable. In this case, the output x(7) is a stationary sequence with a mean value 
of zero. Postmultiplying (4.2.52) by x*(n — 1) and taking the expectation, we obtain the 
following recursive relation for the autocorrelation: 
P 
rr) = —- bo agry(l — k) + Ef{w(n)x*(n — D} (4.2.53) 
k=1 
Similarly, using (4.1.1), we can show that E{w(n)x*(n —1)} = o2,h*(—1). Thus, we have 
P 
A0ys= ye apr —k)+02h*(-l) forall! (4.2.54) 
k=1 


The variance of the output signal is 


P 
a, =rx(0) = — > agry(k) +05, 
k=1 
o2 
Ge ot = 4 (4.2.55) 
1+ > arpy(k) 
k=1 


If we substitute / = 0,1,..., P in (4.2.55) and recall that h(n) = 0 for n < 0, we obtain 
the following set of Yule-Walker equations: 


ry) ry) vee ty (P) 1 o2, 

*(] (0 ss ry (P 1 0 

rx ) é (0) | : ( ) ‘~ salle (4.2.56) 
r*(P) ré(P =). 34s (0) ni 0 


Careful inspection of the above equations reveals their similarity to the correspond- 
ing relationships developed previously for the AP(P) model. This should be no surprise 
since the power spectrum of the white noise is flat. However, there is one important dif- 
ference we should clarify: AP(P) models were specified with a gain do and the parameters 
{a1, a2,..., ap}, but for AR(P) models we set the gain dy) = 1 and define the model by the 


variance of the white excitation ae and the parameters {a), a2,...,ap}. In other words, 
we incorporate the gain of the model into the power of the input signal. Thus, the power 
spectrum of the output is R,(e/®) = o°|H (e/®)|?. Similar arguments apply to all para- 
metric models driven by white noise. We just rederived some of the relationships to clarify 
these issues and to provide additional insight into the subject. 


4.2.4 Lower-Order Models 


In this section, we derive the properties of lower-order all-pole models, namely, first- and 
second-order models, with real coefficients. 


First-order all-pole model: AP(1) 
An AP(1) model has a transfer function 


do 
A(z) = eral (4.2.57) 
with a single pole at z = —a on the real axis. It is clear that H(z) is minimum-phase if 
-l<a<l (4.2.58) 
From (4.2.18) with P = 1 and/ = 1, we have 
r() 
a= 50 =-—p(l) (4.2.59) 


Similarly, from (4.2.44) with m = 1, 
a =a=-p()=h (4.2.60) 


Since from (4.2.4), h(0) = do, and from (4.2.5) h(n) = —a,h(n—1) forn > 0, the impulse 
response of a single-pole filter is given by 


h(n) = do(—a)"u(n) (4.2.61) 


The same result can, of course, be obtained by taking the inverse z-transform of H(z). 
The autocorrelation is found in a similar fashion. From (4.2.18) and by using the fact 
that the autocorrelation is an even function, 
ri) =r(0)(—a)""!—— for all 1 (4.2.62) 
and from (4.2.20) 
dy dp 
0) = = 4.2.63 
ame ea ee eed 


Therefore, if the energy r(Q) in the impulse response is set to unity, then the gain must be 


set to 
Gana TO S4 (4.2.64) 


The z-transform of the autocorrelation is then 


de ed 
R(z) = o =r(0 —a)!!l 2! 4.2.65 
OP aaa ) d | a)"lz (4.2.65) 
and the spectrum is 
; ; a daz 
R(e!®) = |H(e/®)/? = u = w (4.2.66) 


~ |L+ae-Je|2 — 14 2a cosw + a? 

Figures 4.4 and 4.5 show a typical realization of the output, the impulse response, 
autocorrelation, and spectrum of two AP(1) models. The sample process realizations were 
obtained by driving the model with white Gaussian noise of zero mean and unit variance. 
When the positive pole (p = —a = 0.8) is close to the unit circle, successive samples 
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of the output process are similar, as dictated by the slowly decaying autocorrelation and 
the corresponding low-pass spectrum. In contrast, a negative pole close to the unit circle 
results in a rapidly oscillating sequence. This is clearly reflected in the alternating sign of 
the autocorrelation sequence and the associated high-pass spectrum. 


Note that a positive real pole is a type of low-pass filter, while a negative real pole 
has the spectral characteristics of a high-pass filter. (This situation in the digital domain 
contrasts with that in the corresponding analog domain where a real-axis pole can only 
have low-pass characteristics.) The discrete-time negative real pole can be thought of as 
one-half of two conjugate poles at half the sampling frequency. Notice that both spectra are 
even and have zero slope at w = 0 and w = z. These propositions are true of the spectra 
of all parametric models (i.e., pole-zero models) with real coefficients (see Problem 4.13). 

Consider now the real-valued AR(1) process x(n) generated by 


x(n) = —ax(n — 1) + w(n) (4.2.67) 


where {w(n)} ~ WN (0, o2,). Using the formula Ry(z) = 07, H(z) H*(1/z*) and previous 
results, we can see that the autocorrelation and the PSD of x(n) are given by 
2 
2. U 
nO= ip © 


l1-a? 


ad Rx ae - on 1+a?+2a cos w 

respectively. Since 02 = r,(0) = o2,/(1 — a”), the SFM of x(n) is [see (Section 4.1.18)] 
Oe 

ee 


SFM, = = fa (4.2.68) 


Clearly, if a = 0, then from (4.2.67), x(m) is a white noise process and from (4.2.68), 
SFM, = 1. If a — 1, then SFM, — 0; and in the limit when a = 1, the process becomes 
a random walk process, which is a nonstationary process with linearly increasing variance 
E{x?(n)} = no?,. The correlation matrix is Toeplitz, and it is a rare exception in which 
eigenvalues and eigenvectors can be described by analytical expressions (Jayant and Noll 
1984). 


Second-order all-pole model: AP(2) 
The system function of an AP(2) model is given by 

do _ do 
l+ajyz-}4taz-2 (A = pyz 0 = poz!) 
From (4.2.46) and (4.2.47), we have 


H(z) = (4.2.69) 


a, = —(p1 + p2) 

a2 = P1\P2 
Recall that H(z) is minimum-phase if the two poles p; and po are inside the unit circle. 
Under these conditions, a; and az lie in a triangular region defined by 


(4.2.70) 


-l<a<l 
az—a,>-l (4.2.71) 
agt+a,>-—l 


and shown in Figure 4.6. The first condition follows from (4.2.70) since |p;| < 1 and 
|p2| < 1. The last two conditions can be derived by assuming real roots and setting the 
larger root to less than 1 and the smaller root to greater than —1. By adding the last two 
conditions, we obtain the redundant condition az > —1. 
Complex roots occur in the region 

2 

re <a.<1 complex poles (4.2.72) 
with a2 = | resulting in both roots being on the unit circle. Note that, in order to have 
complex poles, a2 cannot be negative. If the complex poles are written in polar form 


ce eye] (4.2.73) 


pi =re 
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FIGURE 4.6 
Minimum-phase region (triangle) for the AP(2) model in the (a1, az) 
parameter space. 


then aj=—2rcos6 a=r’ (4.2.74) 
do 

1 — (2rcos@)z—! + r2z-2 

Here, r is the radius (magnitude) of the poles, and 6 is the angle or normalized frequency 

of the poles. 


and A(z) = 


complex poles (4.2.75) 


Impulse response. The impulse response of an AP(2) model can be written in terms 
of its two poles by evaluating the inverse z-transform of (4.2.69). The result is 


do 


h(n) = (pit! = pit u(n) (4.2.76) 

Pi— p2 

for py # p2. Otherwise, for pj = p2 = p, 
h(n) = do(n + 1)p"u(n) (4.2.77) 


In the special case of a complex conjugate pair of poles pj = re/? and pr = re~/®, 

Equation (4.2.76) reduces to 

,sin[(n + 1)0] 
ind 

Since 0 < r < 1, h(n) is a damped sinusoid of frequency 6. 


h(n) =dor u(n) complex poles (4.2.78) 


Autocorrelation. The autocorrelation can also be written in terms of the two poles as 


dz I+1 I+1 
r= 0 soe Le 1>0 (4.2.79) 
(pi1— pz) — pip2)\1—py 1- ps 


from which we can deduce the energy 
dg(1 + pip) 
(1 = pip2)(1 — py) = p3) 
For the special case of a complex conjugate pole pair, (4.2.79) can be rewritten as 
0) = dor'{sin[(l + 1)6] — r? sin[(l — 1)6]} 
[(1 — r2) sin 0]. — 2r2 cos 20 + r4) 


Then from (4.2.80) we can write an expression for the energy in terms of the polar coordi- 
nates of the complex conjugate pole pair 


r(0) = (4.2.80) 


1>0 (4.2.81) 


dj(.+r?) 


")= GDC —2r? 00820 +4) 


(4.2.82) 


The normalized autocorrelation is given by 


r/{sin[{( + 10] — r? sin[( — 1)6]} 


pl) = (+72) sind 1>0 (4.2.83) 
which can be rewritten as 
pil) = cee r! cos (10 — B) 1>0 (4.2.84) 
(1 —r*)cosé 
where tan B = (4.2.85) 


(1+ r2) sin@ 


Therefore, p(/) is a damped cosine wave with its maximum amplitude at the origin. 


Spectrum. By setting the two poles equal to 

pr=rnel py = rye! (4.2.86) 

the spectrum of an AP(2) model can be written as 
dy 


R(e/®) = 5 5 
[1 — 2r; cos (wm — 61) + ry ][1 — 2r2 cos (w — 62) +175] 


(4.2.87) 


There are four cases of interest 


Pole locations Peak locations Type of R(e/®) 


P>0p2>0 w=0 Low-pass 
Pp <0,p2<0 w= High-pass 
Pi >0p2<0 w=0,0=7 Stopband 
t jo 


P12 =re O<w<z Bandpass 


and they depend on the location of the poles on the complex plane. 
We concentrate on the fourth case of complex conjugate poles, which is of greatest 
interest. The other three cases are explored in Problem 4.15. The spectrum is given by 


dg 
[1 — 2r cos (@ — 6) + r2][1 — 2r cos (a+ 0) + r?] 


The peak of this spectrum can be shown to be located at a frequency w,, given by 


R(e/®) = (4.2.88) 


1+r? 
COS We = ——— cos 8 (4.2.89) 
2r 
Since 1 +r? > 2r forr < 1, and we have 
COS We > cos 0 (4.2.90) 


the spectral peak is lower than the pole frequency for 0 < 6 < m/2 and higher than the 
pole frequency for 7/2 <0 <7. 
This behavior is illustrated in Figure 4.7 for an AP(2) model with aj = —0.4944, 
az = 0.64, and dy = 1. The model has two complex conjugate poles with r = 0.8 and 
= +27 /5.The spectrum has a single peak and displays a passband type of behavior. The 
impulse response is a damped sine wave while the autocorrelation is a damped cosine. The 
typical realization of the output shows clearly a pseudoperiodic behavior that is explained 
by the shape of the autocorrelation and the spectrum of the model. We also notice that if 
the poles are complex conjugates, the autocorrelation has pseudoperiodic behavior. 


Equivalent model descriptions. We now write explicit formulas for a; and az in terms 
of the lattice parameters k,; and k2 and the autocorrelation coefficients. From the step-up 
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AP(2) model with complex conjugate poles. 


and step-down recursions in Section 2.5, we have 


a, = kid +k) 


az =k 
and the inverse relations 
a\ 
ky = 
1l+a2 
kp = a2 


From the Yule-Walker equations (4.2.18), we can write the two equations 
ayr(O) + aar(1) = —r(1) 
ayr(1) + ar (0) = —r(2) 


which can be solved for a; and az in terms of o(1) and p(2) 


1 — p(2) 

=p 

a p( 5 = 

wy aw 2 = 
71 = pr) 
or for p(1) and p(2) in terms of a; and az 
q)=-—4 

. ~ l+a 

ay 


p(2) = —a;p(1) — a2 a2 


= l+a@ 


(4.2.91) 


(4.2.92) 


(4.2.93) 


(4.2.94) 


(4.2.95) 


From the equations above, we can also write the relation and inverse relation between the 


coefficients k; and kz and the normalized autocorrelations p(1) and e(2) as 


kj = —p(1) 
~ (1) = pQ) (4.2.96) 
1— p*(1) 
and pd) =—-k, 


(4.2.97) 
p(2) = ki(1+k2) — kg 


The gain do can also be written in terms of the other coefficients. From (4.2.20), we have 
dj =r (O)[1 +. a1p(1) + arp(2)] (4.2.98) 
which can be shown to be equal to 


dj, =r(0O)(1 —k)d — ko) (4.2.99) 


Minimum-phase conditions. In (4.2.71), we have a set of conditions on a; and a2 so 
that the AP(2) model is minimum-phase, and Figure 4.6 shows the corresponding admissible 
region for minimum-phase models. Similar relations and regions can be derived for the 
other types of parameters, as we will show below. In terms of k; and k2, the AP(2) model 
is minimum-phase if 


|Ai| <1 |ko| <1 (4.2.100) 


This region is depicted in Figure 4.8(a). Shown also is the region that results in complex 
roots, which is specified by 


0<k <1 (4.2.101) 
Ako 

ee 4.2.102 

1” +k? ? 


Because of the correlation matching property of all-pole models, we can find a minimum- 
phase all-pole model for every positive definite sequence of autocorrelation values. There- 
fore, the admissible region of autocorrelation values coincides with the positive definite 
region. The positive definite condition is equivalent to having all the principal minors of 
the autocorrelation matrix in (4.2.30) be positive definite; that is, the corresponding deter- 
minants are positive. For P = 2, there are two conditions: 


1 pl) p2) 


det eas al > 0 det} p(1) 1 pl) | >0 (4.2.103) 
p(2) pl) | 
These two conditions reduce to 
lp(l)| <1 (4.2.104) 
2p7(1)—1 < p(2) <1 (4.2.105) 


which determine the admissible region shown in Figure 4.8(b). Conditions (4.2.105) can 
also be derived from (4.2.71) and (4.2.95). The first condition in (4.2.105) is equivalent to 


| a (4.2.106) 


l+a 

which can be shown to be equivalent to the last two conditions in (4.2.71). 
It is important to note that the region in Figure 4.8(b) is the admissible region for any 
positive definite autocorrelation, including the autocorrelation of mixed-phase signals. This 
is reasonable since the autocorrelation does not contain phase information and allows the 
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Minimum-phase and positive definiteness regions for the AP(2) model in the (a) (ky, 
kz) space and (b) (p(1), e(2)) space. 


signal to have minimum- and maximum-phase components. What we are claiming here, 
however, is that for every autocorrelation sequence in the positive definite region, we can 
find a minimum-phase all-pole model with the same autocorrelation values. Therefore, for 
this problem, the positive definite region is identical to the admissible minimum-phase 
region. 


4.3 ALL-ZERO MODELS 


In this section, we investigate the properties of the all-zero model. The output of the all-zero 
model is the weighted average of delayed versions of the input signal 


Q 
x(n) = So dkw(n —k) (4.3.1) 
k=0 
where Q is the order of the model. The system function is 


O 
H@) =DG@) =>) ae (4.3.2) 
k=0 


The all-zero model can be implemented by using either a direct or a lattice structure. The 
conversion between the two sets of parameters can be done by using the step-up and step- 
down recursions described in Chapter 7 and setting A(z) = D(z). Notice that the same set 
of parameters can be used to implement either an all-zero or an all-pole model by using a 
different structure. 


4.3.1 Model Properties 
We next provide a brief discussion of the properties of the all-zero model. 


Impulse response. It can be easily seen that the AZ(Q) model is an FIR system with 
an impulse response 


h(n) = 4.3.3 
”) 0 elsewhere ( ) 


Autocorrelation. The autocorrelation of the impulse response is given by 


Q-l 
rn) = 3 hiayh*(a-l = dX de Se (4.3.4) 
= 0 1>@Q 
and rz (—l) = rn) all / (4.3.5) 
We usually set do = 1, which implies that 
rn(l) = df + did, +---+dg-ido 1=0,1,...,Q (4.3.6) 


hence, the normalized autocorrelation is 
d; + didi, ferret dg-ido 
PrO=y 1+ |dl?+---+|dol? 
0 l>Q 
We see that the autocorrelation of an AZ(Q) model is zero for lags |/| exceeding the order 
Q of the model. If 0; (1), 07, (2), ..., En(Q) are known, then the Q equations (4.3.7) can 
be solved for model parameters d, dz, ... , dg. However, unlike the Yule-Walker equations 
for the AP(P) model, which are linear, Equations (4.3.7) are nonlinear and their solution is 
quite complicated (see Section 9.3). 


(4.3.7) 


Spectrum. The spectrum of the AZ(Q) model is given by 
Q 
Ry (el) = D@)DE")ieweie = |D(el)? = YY rn De (4.3.8) 
=-@ 
which is basically a trigonometric polynomial. 


Impulse train excitations. The response h(n) of the AZ(Q) model to a periodic im- 
pulse train with period L is periodic with the same period, and its spectrum is a sampled 
version of (4.3.8) at multiples of 27 /L (see Section 2.3.2). Therefore, to recover the auto- 
correlation r,(/) and the spectrum Rj, (e/”) from the autocorrelation or spectrum of h(n), 
we should have L > 2Q + | in order to avoid aliasing in the autocorrelation lag domain. 
Also, if L > Q, the impulse response h(n),0 <n < Q, can be recovered from the response 
h(n) (no time-domain aliasing) (see Problem 4.24). 


Partial autocorrelation and lattice-ladder structures. The PACS of an AZ(Q) model 
is computed by fitting a series of AP(P) models for P = 1,2,..., to the autocorrelation 
sequence (4.3.7) of the AZ(Q) model. Since the AZ(Q) model is equivalent to an AP(co) 
model, the PACS of an all-zero model has infinite extent and behaves as the autocorrelation 
sequence of an all-pole model. This is illustrated later for the low-order AZ(1) and AZ(2) 
models. 


4.3.2 Moving-Average Models 


A moving-average model is an AZ(Q) model with dp = 1 driven by white noise, that is, 


Q 
x(n) = w(n) + So dwin —k) (4.3.9) 
k=1 


where {w(n)} ~ WN(O, oy: The output x() has zero mean and variance of 


Q 
oy =o, lau? (4.3.10) 
k=0 
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The autocorrelation and power spectrum are given by 7;(/) = o?, ra(l) and Ry (e/®) = 
o?,|D(e/ )|?, respectively. Clearly, observations that are more than Q samples apart are 
uncorrelated because the autocorrelation is zero after lag Q. 


4.3.3 Lower-Order Models 


To familiarize ourselves with all-zero models, we next investigate in detail the properties 
of the AZ(1) and AZ(2) models with real coefficients. 


The first-order all-zero model: AZ(1). For generality, we consider an AZ(1) model 
whose system function is 


H(z) = G+ az) (4.3.11) 


The model is stable for any value of dj and minimum-phase for —1 < d, < 1. The 
autocorrelation is the inverse z-transform of 


Ra(z) = H(z) H(z") = G[diz+ (1 +47) + d\z7'] (4.3.12) 


Hence, r;,(0) = G?(1+d7), rn(1) = rn(—1) = G7d1, and rp (1) = Oelsewhere. Therefore, 
the normalized autocorrelation is 


1 126 
ee ae 
o= =+ (4.3.13) 
Ph l+d> 
0 | > 2 


The condition —1 < d; < | implies that |o,(1)| < 5 for a minimum-phase model. From 
pp) = d)/A+ d?), we obtain the quadratic equation 


p,(1)d? — di + p, (1) =0 (4.3.14) 
which has the following two roots: 
1+,/1—4p7(1) 
dq, = ——___ (4.3.15) 
2pn()) 


Since the product of the roots is 1, if dj is a root, then 1/d must also be a root. Hence, only 
one of these two roots can satisfy the minimum-phase condition —1 < d; < 1. 
The spectrum is obtained by setting z = e/® in (4.3.12), or from (4.3.8) 


Ry (e/®) = G7(1 +. dj + 2d) cos @) (4.3.16) 


The autocorrelation is positive definite if R), (e/”) > 0, which holds for all values of d). 
Note that if dj; > 0, then o;,(1) > 0 and the spectrum has low-pass behavior (see Figure 
4.9), whereas a high-pass spectrum is obtained when d, < 0 (see Figure 4.10). 

The first lattice parameter of the AZ(1) model is kj = —p(1). The PACS can be 
obtained from the Yule-Walker equations by using the autocorrelation sequence (4.3.13). 
Indeed, after some algebra we obtain 


_ di)" — d?) 


km = ger m=1,2,...,00 (4.3.17) 
aa | 


(see Problem 4.25). Notice the duality between the ACS and PACS of AP(1) and AZ(1) 
models. 
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Sample realization of the output process, ACS, PACS, and spectrum of an AZ(1) model with d; = 0.95. 
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Sample realization of the output process, ACS, PACS, and spectrum of an AZ(1) model with 


dy = —0.95. 


Consider now the MA(1) real-valued process x(n) generated by 
x(n) = w(n) + bw(n — 1) 
where {w(n)} ~~ WN(0, o2,). Using Ry(z) = 02, H(z)H(z7!), we obtain the PSD function 


R,(e/®) = 02 (1 +b? + 2bcosa) 


176 which has low-pass (high-pass) characteristics if 0 < b < 1 (-1 < b < QO). Since 
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SFM, = 22 =! (4.3.18) 
* o2 1 4b2 . 


which is maximum for b = 0 (white noise). The correlation matrix is banded Toeplitz (only 
a number of diagonals close to the main diagonal are nonzero) 


1 boO.--- 0 
BOL by ee 0 

R, =02,(1+b7)|0 b 1 --- O (4.3.19) 
OO 0 se55 a 


and its eigenvalues and eigenvectors are given by Ax, = Ry (e/ Ok), qi = sinagn, OK = 
wk/(M + 1), where k = 1,2,..., M (see Problem 4.30). 


The second-order all-zero model: AZ(2)._ Now let us consider the second-order all- 
zero model. The system function of the AZ(2) model is 


H(z) = G0 +dyz~! + dz?) (4.3.20) 


The system is stable for all values of d; and dz, and minimum-phase [see the discussion for 
the AP(2) model] if 


-l<a<l 
dy —d, > —-1 (4.3.21) 
dy+d,>-1 


which is a triangular region identical to that shown in Figure 4.6. The normalized autocor- 
relation and the spectrum are 


1 1=0 
d\(1 + dz) 
i a eS | 
1+d?}+d5 
p,() = ‘ (4.3.22) 
ere 
1+d}+d5 
0 | > 3 
and = Rp(e/®) = G7[. +. dz +. d3) + 2d (1 + dy) cos w + 2d2 cos 20] (4.3.23) 


respectively. 
The minimum-phase region in the autocorrelation domain is shown in Figure 4.11 and 
is described by the equations 


p(2) + pl) = —0.5 
p(2) — p(1) = —0.5 (4.3.24) 
p’(1) = 4(2)[1 — 20(2)] 


derived in Problem 4.26. The formula for the PACS is quite involved. The important thing 
is the duality between the ACS and the PACS of AZ(2) and AP(2) models (see Problem 
4.27). 


1.0 FIGURE 4.11 
Minimum-phase region in the autocorrelation domain 
for the AZ(2) model. 


p(2) 


p(1) 


4.4 POLE-ZERO MODELS 


We will focus on causal pole-zero models with a recursive input-output relationship given 


by 
P 


Q 
x(n) = —Yoaxna-kh + dwn =f) (4.4.1) 
k=1 k=0 
where we assume that P > 0 and Q > 1. The models can be implemented using either 
direct-form or lattice-ladder structures (Proakis and Manolakis 1996). 


4.4.1 Model Properties 
In this section, we present some of the basic properties of pole-zero models. 


Impulse response. The impulse response of a causal pole-zero model can be written 
in recursive form from (4.4.1) as 


P 
hin)=—-)oah(n—k) +d,  n>0 (4.4.2) 
k=1 
where d, =0 n>Q 


and h(n) = 0 forn < 0. Clearly, this formula is useful if the model is stable. From (4.4.2), 


it is clear that 
P 


h(n) = -\- axh(n — k) n>Q (4.4.3) 
k=1 
so that the impulse response obeys the linear prediction equation for n > Q. Thus if we 
are given h(n), 0 < n < P+ Q, we can compute {ax} from (4.4.3) by using the P 
equations specified by Q@+ 1 <n < Q+ P. Then we can compute {d;} from (4.4.2), using 
0 <n < Q. Therefore, the first P + Q + 1 values of the impulse response completely 
specify the pole-zero model. 
If the model is minimum-phase, the impulse response of the inverse model h;(n) = 
Z-'{A(z)/D(z)}, do = 1 can be computed in a similar manner. 


Autocorrelation. The complex spectrum of H(z) is given by 

L\  D()D*(1/z*) . Raz) 
-) ~ A(Z)A*(1/z*) — Ra(2) 
where Rq(z) and Ra(z) are both finite two-sided polynomials. In a manner similar to the 


(4.4.4) 


R,(z) = H(z) H* ( 
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all-pole case, we can write a recursive relation between the autocorrelation, impulse re- 
sponse, and parameters of the model. Indeed, from (4.4.4) we obtain 


A(2)Ry(z) = Dg) H* (=) (4.4.5) 


Taking the inverse z-transform of (4.4.5) and noting that the inverse z-transform of H*(1/z*) 
is h*(—n), we have 


P Q 
we agrn(l —k) = ye dh*(k — 1) for all / (4.4.6) 
k=0 k=0 
Since h(n) is causal, we see that the right-hand side of (4.4.6) is zero for] > Q: 
P 
Yo aera (l — k) =0 1>@Q (4.4.7) 
k=0 


Therefore, the autocorrelation of a pole-zero model obeys the linear prediction equation for 
l>Q. 

Because the impulse response /(n) is a function of a, and dx, the set of equations 
in (4.4.6) is nonlinear in terms of parameters ax and dg. However, (4.4.7) is linear in ax; 


therefore, we can compute {a;} from (4.4.7), using the set of equations for] = Q + 
1,..., @+ P, which can be written in matrix form as 
rn(Q) rn(Q — 1) st TRO = Ppl) Tay mn(Q + 1) 
m(Q+ 1) rn(Q) ss MH(Q-—P+2)) lar} | m(Q+2) 
m(Q+P—1) rm(Q+P—2) -+- rA(Q) ap mn(Q + P) 
(4.4.8) 
or R,a = —T, (4.4.9) 


Here, R, is a non-Hermitian Toeplitz matrix, and the linear system (4.4.8) can be solved 
by using the algorithm of Trench (Trench 1964; Carayannis et al. 1981). 

Even after we solve for a, (4.4.6) continues to be nonlinear in d;. To compute d;, we 
use (4.4.4) to find Ry(z) 


Ra(Z) = Raz) Ra) (4.4.10) 


where the coefficients of Ra(z) are given by 
k=ky 


0, 1>0 P-l, 1>0 
ral) = D7 ana yy, PSI <P, m= {o iene a= {> eh 
k=ky ’ ’ 

(4.4.11) 

From (4.4.10), rg(Z) is the convolution of rg (/) with r,(1), given by 

P 
ra(l) = :S ra(k)rn(l — k) (4.4.12) 
k=—P 


If r(J) was originally the autocorrelation of a PZ(P, Q) model, then rg(/) in (4.4.12) will 
be zero for |/| > Q. Since Rg(z) is specified, it can be factored into the product of two 
polynomials D(z) and D*(1/z*), where D(z) is minimum-phase, as shown in Section 2.4. 

Therefore, we have seen that, given the values of the autocorrelation r; (1) of aPZ(P, Q) 
model in the range 0 <7 < P + Q, we can compute the values of the parameters {a,} and 
{dx} such that H(z) is minimum-phase. Now, given the parameters of a pole-zero model, 
we can compute its autocorrelation as follows. Equation (4.4.4) can be written as 


Ry (z) = Ry! (z) Ra (4.4.13) 


where R, 1(z) is the spectrum of the all-pole model 1/A(z), that is, 1/Rqg(z). The coeffi- 
cients of R7 '(z) can be computed from {ax} by using (4.2.20) and (4.2.18). The coefficients 
of Rq(z) are computed from (4.3.8). Then R;,(z) is the convolution of the two autocorre- 
lations thus computed, which is equivalent to multiplying the two polynomials in (4.4.13) 
and equating equal powers of z on both sides of the equation. Since Rg(z) is finite, the 
summations used to obtain the coefficients of R,(z) are also finite. 


EXAMPLE 4.4.1. Consider a signal that has autocorrelation values of r;,(0) = 19, r,(1) = 9, 
rj,(2) = —5, and ry, (3) = —7. The parameters of the PZ(2, 1) model are found in the following 
manner. First form the equation from (4.4.8) 


EE SI}-bI 


which yields aj = a2 = 5. Then we compute the coefficients from (4.4.11), ra(0) = 3, 


1 
a 
ra(#l) = 3, and rqg(+2) = i. Computing the convolution in (4.4.12) for] < Q = 1, we 
obtain the following polynomial: 


=1 
Ra(z) = 42+ 104427! =4 ( + =] (z+ 2) 


Therefore, D(z) is obtained by taking the causal part, that is, D(z) = 2[1 + z!/(2)], and 
dj=t. 
2 


Spectrum. The spectrum of H(z) is given by 
_ |Deel®)/? 


jW\ __ jw, \2 
Ra(e!®) = |H(e?®)| = Tae) 2 


(4.4.14) 
Therefore, Rj, (e/”) can be obtained by dividing the spectrum of D(z) by the spectrum of 
A(z). Again, the FFT can be used to advantage in computing the numerator and denominator 
of (4.4.14). If the spectrum Rp, (e/”) of a PZ(P, Q) model is given, then the parameters of 
the (minimum-phase) model can be recovered by first computing the autocorrelation ry (/) 
as the inverse Fourier transform of R;,(e/”) and then using the procedure outlined in the 
previous section to compute the sets of coefficients {a,} and {dx}. 


Partial autocorrelation and lattice-ladder structures. Since a PZ(P, Q) model is 
equivalent to an AP(co) model, its PACS has infinite extent and behaves, after a certain lag, 
as the PACS of an all-zero model. 


4.4.2 Autoregressive Moving-Average Models 


The autoregressive moving-average model is a PZ(P, Q) model driven by white noise and 
is denoted by ARMA(P, Q). Again, we set dy = 1 and incorporate the gain into the variance 
(power) of the white noise excitation. Hence, a causal ARMA(P, Q) model is defined by 


P Q 
x(n) = —) > agx(n—k) + w(n) + DY) dkw(n —k) (4.4.15) 
k=1 k=1 
where {w(n)} ~ WN(0, o2,). The ARMA(P, Q) model parameters are {o7,, a1,..., ap, 
d,..., dg}. The output has zero mean and variance of 
P Q 
= -)~° agry(k) + 02,[1 + SS dh(k)] (4.4.16) 
k=1 k=1 


where h(n) is the impulse response of the model. The presence of h(n) in (4.4.16) makes 
the dependence of ae on the model parameters highly nonlinear. The autocorrelation of 
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x(n) is given by 


P Q 
Yo aers (I — k) = 03, | 1+ Do dkhk—D for all J (4.4.17) 
k=0 k=1 


and the power spectrum by 


2 De!) 


Joy — 
BO) = Cw TA eioyp 


(4.4.18) 
The significance of ARMA(P, Q) models is that they can provide more accurate repre- 
sentations than AR or MA models with the same number of parameters. The ARMA model 


is able to combine the spectral peak matching of the AR model with the ability of the MA 
model to place nulls in the spectrum. 


4.4.3 The First-Order Pole-Zero Model: PZ(1, 1) 


Consider the PZ(1, 1) model with the following system function 


1+d,z7! 
A(z) = G ——— 4.4.19 
@) l+aj,z7! ( ) 
where d and qa; are real coefficients. The model is minimum-phase if 
-l<d,<1l 
(4.4.20) 
-l<aq <1 


which correspond to the rectangular region shown in Figure 4.12(a). 
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FIGURE 4.12 
Minimum-phase and positive definiteness regions for the PZ(1, 1) model in the 
(a) (d1, 41) space and (b) (p(1), p(2)) space. 


For the minimum-phase case, the impulse responses of the direct and the inverse models 
are 
0 n<0O 
h(n) = Z~'{H(2} = 4G n=0 (4.4.21) 
G(-a)"""(dq,-a1)  n >0 


0 n<0O 
and hy(n) = 27! laot =4%G n=0 (4.4.22) 
OF lecayMar- dy) 2 >0 
respectively. We note that as the pole p = —a, gets closer to the unit circle, the impulse 
response decays more slowly and the model has “longer memory.” The zero z = —d 


controls the impulse response of the inverse model in a similar way. The PZ(1, 1) model is 
equivalent to the AZ(co) model 


x(n) = Gw(n) + G Ss h(k)w(n —k) (4.4.23) 
k=1 
or the AP(co) model 
x(n) = —) hj (x(n — k) + Gun) (4.4.24) 
k=1 


If we wish to approximate the PZ(1, 1) model with a finite-order AZ(Q) model, the order Q 
required to achieve a certain accuracy increases as the pole moves closer to the unit circle. 
Likewise, in the case of an AP( P) approximation, better fits to the PZ(P, Q) model require 
an increased order P as the zero moves closer to the unit circle. 

To determine the autocorrelation, we recall from (4.4.6) that for a causal model 


rn(l) = —airpa(l — 1) + Gh(-1) + Gdihi — 1) all / (4.4.25) 
or rn(0) = —airn(1) + G + Gd, (d, — ay) 
rn(1) = —ayrp_, (0) + Gd, (4.4.26) 


rh) = —airn(l — 1) I>2 
Solving the first two equations for r;,(0) and r;,(1), we obtain 
1+d} —2aid\ 


rn(0) = G ; (4.4.27) 

1- ay 

d| —a,)(1 —ayd 

and Kage am = ) (4.4.28) 

1- ay 

The normalized autocorrelation is given by 
(d; — a1) — ayd)) 

l)= 4.4.29 
je 1 +d? — 2aidy re 
and pn) = (—a1)''p,@-1)  1>2 (4.4.30) 


Note that given p;,(1) and p;(2), we have a nonlinear system of equations that must be 
solved to obtain a; and d,. By using Equations (4.4.20), (4.4.29), and (4.4.30), it can be 
shown (see Problem 4.28) that the PZ(1, 1) is minimum-phase if the ACS satisfies the 
conditions 
1e(2)| < |e()| 

p(2) > p)2e() + 1 p(l) <0 (4.4.31) 

p(2) > p()2e() - I p(l) > 0 
which correspond to the admissible region shown in Figure 4.12(b). 


4.4.4 Summary and Dualities 
Table 4.1 summarizes the key properties of all-zero, all-pole, and pole-zero models. These 


properties help to identify models for empirical discrete-time signals. Furthermore, the table 
shows the duality between AZ and AP models. More specifically, we see that 
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TABLE 4.1 


1. An invertible AZ(Q) model is equivalent to an AP(oo) model. Thus, it has a finite-extent 
autocorrelation and an infinite-extent partial autocorrelation. 

2. Astable AP(P) model is equivalent to an AZ(co) model. Thus, it has an infinite-extent 
autocorrelation and a finite-extent partial autocorrelation. 

3. The autocorrelation of an AZ(Q) model behaves as the partial autocorrelation of an 
AP(P) model, and vice versa. 

4. The spectra of an AP(P) model and an AZ(Q) model are related through an inverse 
relationship. 


Summary of all-pole, all-zero, and pole-zero model properties 


Model 


AP(P) AZ(Q) PZ(P, Q) 


Input-output description 


System function 


Recursive representation 
Nonrecursive representation 
Stablity conditions 
Invertiblity conditions 


Autocorrelation sequence 


Partial autocorrelation 


Spectrum 


P Q P 
x(n) + Yo apx(n — k) = w(n) x(n) = dgw(n)+ YS dkwn—k) x(n) + YO agx(n—k) 
k=1 k=1 k=1 


Q 
=dgw(n)+ >> dew(n—k) 
k=1 


do 


1 
A(z) = = 
A(z) P k 
T+ ye, az 

k=1 
Finite summation 
Infinite summation 
Poles inside unit circle 
Always 


Infinite duration (damped 
exponentials and/or sine waves) 


Tails off 


Finite duration 


Cuts off 
Good peak matching 


Q 
H(z) = D(z) =dy+ Y. dgz* 
k=l 


Infinite summation 
Finite summation 
Always 

Zeros inside unit circle 


Finite duration 


Cuts off 


Infinite duration (damped 
exponentials and/or sine waves) 


Tails off 


Good “notch” matching 


_ D@&) 
A(z) 


H(z) 


Infinite summation 
Infinite summation 
Poles inside unit circle 
Zeros inside unit circle 


Infinite duration (damped 
exponentials and/or sine 
waves after Q — P lags) 


Tails off 
Infinite duration (dominated 
by damped exponentials 


and/or sine waves 
after Q — P lags) 


Tails off 


Good peak and valley 
matching 


These dualities and properties have been shown and illustrated for low-order models 


in the previous sections. 


4.5 MODELS WITH POLES ON THE UNIT CIRCLE 


In this section, we show that by restricting some poles to being on the unit circle, we obtain 
models that are useful for modeling certain types of nonstationary behavior. 

Pole-zero models with poles on the unit circle are unstable. Hence, if we drive them 
with stationary white noise, the generated process is nonstationary. However, as we will see 
in the sequel, placing a small number of real poles at z = | or complex conjugate poles at 
zx = e*/%k provides a class of models useful for modeling certain types of nonstationary 
behavior. The system function of a pole-zero model with d poles at z = 1, denoted as PZ(P, 


d, Q), is 


_D@ 1 


WO AG) hae oe 


(4.5.1) 


and can be viewed as PZ(P, Q) model, D(z)/A(z), followed by a dth-order accumulator. 
The accumulator y(n) = y(n — 1) + x(n) has the system function 1/(1 — z_/) and can be 
thought of as a discrete-time integrator. The presence of the unit poles makes the PZ(P, d, 
Q) model non-minimum-phase. Since the model is unstable, we cannot use the convolution 
summation to represent it because, in practice, only finite-order approximations are possible. 
This can be easily seen if we recall that the impulse response of the model PZ(0, d, 0) equals 
u(n) for d = 1 and (n+ 1)u(n) for d = 2. However, if D(z)/A(z) is minimum-phase, the 
inverse model H7(z) = 1/H(z) is stable, and we can use the recursive form (see Section 
4.1) to represent the model. Indeed, we always use this representation when we apply this 
model in practice. 
The spectrum of the PZ(0, d, 0) model is 


1 
[2 sin(w/2)|24 


and since Rq(0) = ae rq(1) = ox, the autocorrelation does not exist. 


Ra(el®) = (4.5.2) 


In the case of complex conjugate poles, the term (1 — z~!)¢ in (4.5.1) is replaced by 
(1 —2cos 6, z~! +. z~2)4, that is, 
D(z) 1 
A(z) (1 — 2cos 0, 27! + z72)4 


The second term is basically a cascade of AP(2) models with complex conjugate poles 
on the unit circle. This model exhibits strong periodicity in its impulse response, and its 
“resonance-like” spectrum diverges at m = Ox. 

With regard to the partial autocorrelation, we recall that the presence of poles on the 
unit circle results in some lattice parameters taking on the values +1. 


A(z) = 


(4.5.3) 


EXAMPLE 4.5.1. Consider the following causal PZ(1, 1, 1) model 


—ltayz7! 1 L+d,z7! 


= 4.5.4 
l+ayz7!1—z7! 1— (1 —ay)z7! — ayz72 ( ) 


with —1 <a, <1land—-1 <d, <1. 
The difference equation representation of the model uses previous values of the output and 
the present and previous values of the input. It is given by 


yin) = (1 -—ay)y—1) + a;y—2)4+ x(n) + dyx(n —- 1) (4.5.5) 


To express the output in terms of the present and previous values of the input (nonrecursive 
representation), we find the impulse response of the model 


h(n) = 2~!{H(z)} = Aqu(n) + Ap(—a)"u(n) (4.5.6) 


where Aj = (1+ d})/(1 +. a1) and A> = (a; — d,)/(1 +41). Note that the model is unstable, 
and it cannot be approximated by an FIR system because h(n) > A,u(n) asn > oo. 

Finally, we can express the output as a weighted sum of previous outputs and the present 
input, using the impulse response of the inverse model G(z) = 1/H(z) 


hy(a) = 27'{H7(z)} = B,d(n) + Byd(n — 1) + B3(—d})"u(n) (4.5.7) 


where By = (aj —d) +.a1d})/d?, By = —a/d), and Bz = (—ay +d) — ayd) +d?)/d?. Since 
—1 <d, < 1, the sequence h;(n) decays at a rate governed by the value of d). If hy(n) ~ 0 
for n > pq, the recursive formula 


Pd 
y(n) = —) hy (y(n =k) + x(n) (4.5.8) 
k=1 
provides a good representation of the PZ(1, 1, 1) model. For example, if aj = 0.3 and d; = 0.5, 
we find that |h7(n)| < 0.0001 for n > 12, which means that the current value of the model 
output can be computed with sufficient accuracy from the 12 most recent values of signal y(n). 
This is illustrated in Figure 4.13, which also shows a realization of the output process if the 
model is driven by white Gaussian noise. 
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Sample realization Inverse model: h(n) 


25 F T T T 5 1.0F T T 
o 207 7 o O57 
= 3 t 
2157 7 = 0 pte tecccecece.e 
E 10+ J ¢ 
< z ~0.57 
5r 7 -1.04+ 
i i i i i i 
0 50 100 150 200 0 5 10 15 
Sample number Sample number 
Direct model: h(n) Spectrum 
1.3 CT T T a 80 é. 4 mi 
2125 4 2 60 
= f=} 
So 11+ 4 ‘a 40 
E E 
46 | | < 20 
0.9 0 
0 5 10 15 0 0.1 02 #03 04 490.5 
Sample number Frequency (cycles/sampling interval) 
FIGURE 4.13 


Sample realization of the output process, impulse response, impulse response of the inverse 
model, and spectrum of a PZ(1, 1, 1) model with aj = 0.3, dj = 0.5, and d = 1. The value 
R(e/) = 00 is not plotted. 


Autoregressive integrated moving-average models. In Section 3.3.2 we discussed 
discrete-time random signals with stationary increments. Clearly, driving a PZ(P, d, Q) 
model with white noise generates a random signal whose dth difference is a stationary 
ARMA(P, Q) process. Such time series are known in the statistical literature as autore- 
gressive integrated moving-average models, denoted ARIMA (P, d, Q). They are useful in 
modeling signals with certain stochastic trends (e.g., random changes in the level and slope 
of the signal). Indeed, many empirical signals (e.g., infrared background measurements and 
stock prices) exhibit this type of behavior (see Figure 1.6). Notice that the ARIMA(0, 1, 0) 
process, that is, x(n) = x(n — 1) + w(n), where {w(n)} ~ WN(O, ot); is the discrete-time 
equivalent of the random walk or Brownian motion process (Papoulis 1991). 

When the unit poles are complex conjugate, the model is known as a harmonic PZ 
model. This model produces random sequences that exhibit “random periodic behavior” and 
are known as seasonal time series in the statistical literature. Such signals repeat themselves 
cycle by cycle, but there is some randomness in both the length and the pattern of each cycle. 
The identification and estimation of ARIMA and seasonal models and their applications can 
be found in Box, Jenkins, and Reinsel (1994); Brockwell and Davis (1991); and Hamilton 
(1994). 


4.6 CEPSTRUM OF POLE-ZERO MODELS 


In this section we determine the cepstrum of pole-zero models and its properties, and 
we develop algorithms to convert between direct structure model parameters and cepstral 
coefficients. The cepstrum has been proved a valuable tool in speech coding and recognition 
applications and has been extensively studied in the corresponding literature (Rabiner and 
Schafer 1978; Rabiner and Juang 1993; Furui 1989). For simplicity, we consider models 
with real coefficients. 


4.6.1 Pole-Zero Models 


The cepstrum of the impulse response /:(n) of a pole-zero model is the inverse z-transform 
of 


log H(z) = log D(z) — log A(z) (4.6.1) 
Q P 

= log do +) “log (1 — ziz7') — log (1 = piz!) (4.6.2) 
i=l i=l 


where {z;} and {p;} are the zeros and poles of H(z), respectively. If we assume that H(z) 
is minimum-phase and use the power series expansion 


[o,e) a” 
log (1 —az7') = -)° ial Iz| > |a| 


n=1 
we find that the cepstrum c(7) is given by 


0 n<0O 
log do n=0 


te z 
7 yar, - eZ n>0O 
i=l i=1 


Since the poles and zeros are assumed to be inside the unit circle, (4.6.3) implies that c(7) 
is bounded by 


c(n) = (4.6.3) 


PTL <c(n) < -t8 (4.6.4) 


with equality if and only if all the roots are appropriately at z = | or z = —1. 
If H(z) is minimum-phase, then there exists a unique mapping between the cepstrum 
and the impulse response, given by the recursive relations (Oppenheim and Schafer 1989) 


c(0) = log h(O) = log do 


n—-1 
sn h(n) = I cba —m) Pe, (4.6.5) 
h(O) on iat h(O) 
and h(O) = e&© 
= 4.6.6 
h(n) = h(O)c(n) + 2 > mc(m)h(n — m) n>0O ( ) 
a m=0 


where we have assumed dp > 0 without loss of generality. Therefore, given the cepstrum 
c(n) inthe range0 < n < P+Q,wecancompletely recover the parameters of the pole-zero 
model as follows. From (4.6.6) we can compute h(n), 0 <n < P+ Q, and from (4.4.2) 
and (4.4.3) we can recover {az} and {d;}. 


4.6.2 All-Pole Models 


The cepstrum of a minimum-phase all-pole model is given by (4.6.2) and (4.6.3) with 
Q = 0. Since H(z) is minimum-phase, the cepstrum c() of 1/A(z) is simply the negative 
of the cepstrum of A(z), which can be written in terms of ax (see also Problem 4.34). As a 
result, the cepstrum can be obtained from the direct-form coefficients by using the following 
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recursion 
1 n—-1 
a — So (n =k) ag e(n — k) l<n<P 
n 
en = go (4.6.7) 
1 
--)o(n-b acin-k) n>P 
ue k=1 
The inverse relation is 
1 n—-1 
ayn = —c(n) Yia k)agc(n — k) n>0O (4.6.8) 
m k=1 


which shows that the first P cepstral coefficients completely determine the model parameters 
(Furui 1981). 

From (4.6.7) it is evident that the cepstrum generally decays as 1/n. Therefore, it may 
be desirable sometimes to consider 


c'(n) = nc(n) (4.6.9) 


which is known as the ramp cepstrum since it is obtained by multiplying the cepstrum by 
a ramp function. From (4.6.9) and (4.6.4), we note that the ramp cepstrum of an AP(P) 
model is bounded by 


I’(n)| <P n>O0 (4.6.10) 


with equality if and only if all the poles are at z = 1 or z = —1. Also c’(n) is equal to 
the negative of the inverse z-transform of the derivative of log H(z). From the preceding 
equations, we can write 


n—-1 
c'(n) = —nay — Yage'(n—k) 1 <n <P (4.6.11) 
k=1 
é 
c(nk=—-dlacn-k n>P (4.6.12) 
k=1 
I n—-1 
and an = = [ew +o age!(n- ») n>0 (4.6.13) 
i: k=1 


It is evident that the first P values of c’(n), 1 < n < P, completely specify the model 
coefficients. However, since c’(0) = 0, the information about the gain dp is lost in the ramp 
cepstrum. Equation (4.6.12) for n > P is reminiscent of similar equations for the impulse 
response in (4.2.5) and the autocorrelation in (4.2.18), with the major difference that for the 
ramp cepstrum the relation is only true forn > P, while for the impulse response and the 
autocorrelation, the relations are true forn > 0 and k > 0, respectively. 

Since R(z) = H(z)H(z~'), we have 


log R(z) = log H(z) + log H(z7!) (4.6.14) 
and if c,() is the real cepstrum of R(e/ ®), we conclude that 
cr(n) = c(n) + c(—n) (4.6.15) 
For minimum-phase H(z), c(n) = 0 for n < 0. Therefore, 


c(—n) n<0O 
cy(n) = { 2c(0) n=0 (4.6.16) 
c(n) n>0O 


0 n<0O 
0 
aa BGs aa ae, (4.6.17) 
cr (n) n>0O 


In other words, the cepstrum c(n) can be obtained simply by taking the inverse Fourier 
transform of log R(e/®) to obtain c;(n) and then applying (4.6.17). 


EXAMPLE 4.6.1. From (4.6.7) we find that the cepstrum of the AP(1) model is given by 


0 n<0O 
ey = 110240 eS (4.6.18) 
1 
—(—a)" n>0 
n 


From (4.2.18) with P = 1 and k = 1, we have a\) = —r(1)/r(0) = ky; and from (4.6.7) we 
have a; = —c(1). These results are summarized below: 


a\) =a = —p(1) =k = —c(1) (4.6.19) 


The fact that o(1) = c(1) here is peculiar to a single-pole spectrum and is not true in general 
for arbitrary spectra. And e(1) is the integral of a cosine-weighted spectrum while c(1) is the 
integral of a cosine-weighted log spectrum. 


EXAMPLE 4.6.2. From (4.6.7), the cepstrum for an AP(2) model is equal to 


0 n<0 
cya VEO a (4.6.20) 
Il n n 
paca + p35) n>0 
For a complex conjugate pole pair, we have 
2 n 
c(n) = —r- cosné n>0O (4.6.21) 


n 


where p12 = rexp(+j0). Therefore, the cepstrum of a damped sine wave is a damped cosine 
wave. The cepstrum and autocorrelation are similar in that they are both damped cosines, but 
the cepstrum has an additional 1/n weighting. From (4.6.7) and (4.6.8) we can relate the model 
parameters and the cepstral coefficients: 


a, = —c(1) 
4.6.22 
ay = —c(2) + $e7(1) ; 
and c(l) = —a} 
(4.6.23) 


c(2) = -—a2 + say 


Using (4.2.71) and the relations for the cepstrum, we can derive the conditions on the cepstrum 
for H(z) to be minimum-phase: 


2 

doys TPs 
(1) 

CO) eel (4.6.24) 
c*(1) 

c(2) < —— +c) +1 


2. 


The corresponding admissible region is shown in Figure 4.14. The region corresponding to 
complex roots is given by 
e7(1) 


5) —1 <Q) <—- (4.6.25) 
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e(1) 


FIGURE 4.14 
Minimum-phase region of the AP(2) model in the cepstral domain. 


In comparing Figures 4.6, 4.8, and 4.14, we note that the admissible regions for the 
PACS and ACS are convex while that for the cepstral coefficients is not. (A region is convex 
if a straight line drawn between any two points in the region lies completely in the region.) 
In general, the PACS and the ACS span regions or spaces that are convex. The admissible 
region in Figure 4.14 for the model coefficients is also convex. However, for P > 2 the 
admissible regions for the model coefficients are not convex, in general. 


Cepstral distance. A measure of the difference between two signals, which has many 
applications in speech coding and recognition, is the distance between their log spectra 
(Rabiner and Juang 1993). It is known as the cepstral distance and is defined as 


cb 4 =f |log Ry (e/”) — log Ro(e/) |? dw (4.6.26) 
WT Jon 
= Yi lat) -amP (4.6.27) 


where c,(m) and c2(n) are the cepstral coefficients of R 1(e/”) and R2(e/®), respectively 
(see Problem 4.36). Since for minimum-phase sequences the cepstrum decays fast, the 
summation (4.6.27) can be computed with sufficient accuracy using a small number of 
terms, usually 20 to 30. For minimum-phase all-pole models, which are mostly used in 
speech processing, the cepstral coefficients are efficiently computed using the recursion 
(4.6.7). 


4.6.3 All-Zero Models 


The cepstrum of a minimum-phase all-zero model is given by (4.6.2) and (4.6.3) with 
P = 0. The cepstrum corresponding to a minimum-phase AZ(Q) model is related to its 
real cepstrum by 


0 n<0O0 
cr (n) 
c,(n) n>0O 


Since we found c(n), the coefficients of a minimum-phase AZ(Q) model D(z) can be 


evaluated recursively from 


edo k=0 
k-1 
1 (4.6.29) 
c(k)do + ga melndi—m 1<k<@Q 
m= 


This procedure for finding a minimum-phase polynomial D(z) from the autocorrelation 
consists in first computing the cepstrum from the log spectrum, then applying (4.6.28) 
and the recursion (4.6.29) to compute the coefficients dy. This approach to the spectral 
factorization of AZ(Q) models is preferable because finding the roots of R(z) for large Q 
may be cumbersome. 


Mixed pole-zero model representations. In the previous sections we saw that the P + 
Q +1 parameters of the minimum-phase PZ(P, Q) model can be represented equivalently 
and uniquely by P + Q + 1 values of the impulse response, the autocorrelation, or the 
cepstrum. A question arises as to whether PZ(P, Q) can be represented uniquely by a 
mixture of representations, as long as the total number of representative values is P+Q+1. 
For example, could we have a unique representation that consists of, say, Q autocorrelation 
values and P + | impulse response values, or some other mixture? The answer to this 
question has not been explored in general; the relevant equations are sufficiently nonlinear 
that a totally different approach would appear to be needed to solve the general problem. 


4.7 SUMMARY 


In this chapter we introduced the class of pole-zero signal models and discussed their 
properties. Each model consists of two components: an excitation source and a system. 
In our treatment, we emphasized that the properties of a signal model are shaped by the 
properties of both components; and we tried, whenever possible, to attribute each property to 
its originator. Thus, for uncorrelated random inputs, which by definition are the excitations 
for ARMA models, the second-order moments of the signal model and its minimum-phase 
characteristics are completely determined by the system. For excitations with line spectra, 
properties such as minimum phase are meaningful only when they are attributed to the 
underlying system. If the goal is to model a signal with a line PSD, the most appropriate 
approach is to use a harmonic process. 

We provided a detailed description of the autocorrelation, power spectrum density, 
partial correlation, and cepstral properties of all AZ, AP, and PZ models for the general 
case and for first- and second-order models. An understanding of these properties is very 
important for model selection in practical applications. 


PROBLEMS 


4.1 Show that a second-order pole p; contributes the term n pe u(n) and a third-order pole the terms 


np? u(n)+ n? p” u(n) to the impulse response of a causal PZ model. The general case is discussed 
in Oppenheim et al. (1997). 


4.2 Consider a zero-mean random sequence x(n) with PSD 


5+3cosw 
17+ 8cosw 


(a) Determine the innovations representation of the process x(n). 
(b) Find the autocorrelation sequence r, (/). 


Rx (e/®) = 
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4.3 


4.4 


4.5 


4.6 


4.7 


4.8 


4.9 


4.10 


4.11 


4.12 


We want to generate samples of a Gaussian process with autocorrelation ry (J) = Gyl +(- sll 
for all J. 


(a) Find the difference equation that generates the process x(n) when excited by w(n) ~ 
WGN(0, 1). 

(b) Generate N = 1000 samples of the process and estimate the pdf, using the histogram and 
the normalized autocorrelation p, (J) using 6, (J) [see Equation (1.2.1)]. 

(c) Check the validity of the model by plotting on the same graph (i) the true and estimated pdf 
of x(n) and (ii) the true and estimated autocorrelation. 


Compute and compare the autocorrelations of the following processes: 


(a) xj(n) = w(n) + 0.3 w(n — 1) — 0.4w(n — 2) and 
(b) x2(n) = w(n) — 1.2w(n — 1) — 1.6w(n — 2) where w(n) ~ WGN(O, 1). 


Explain your findings. 


Compute and plot the impulse response and the magnitude response of the systems H(z) and 
Hy (z) in Example 4.2.1 for a = 0.7, 0.95 and N = 8, 16, 64. Investigate how well the all-zero 
systems approximate the single-pole system. 


Prove Equation (4.2.35) by writing explicitly Equation (4.2.33) and rearranging terms. Then 
show that the coefficient matrix A can be written as the sum of a triangular Toeplitz matrix and 
a triangular Hankel matrix (recall that a matrix H is Hankel if the matrix JHJ A ig Toeplitz). 


Use the Yule-Walker equations to determine the autocorrelation and partial autocorrelation 
coefficients of the following AR models, assuming that w(n) ~ WN(O, 1). 


(a) x(n) =0.5x(n — 1) + w(n). 
(b) x(n) = 1.5x(n — 1) — 0.6x(n — 2) + w(n). 


What is the variance oz of the resulting process? 


Given the AR process x(n) = x(n — 1) — 0.5x(n — 2) + w(n), complete the following tasks. 


(a) Determine p, (1). 

(b) Using p,(0) and p, (1), compute {p, (I ey by the corresponding difference equation. 

(c) Plot p,(/) and use the resulting graph to estimate its period. 

(d) Compare the period obtained in part (c) with the value obtained using the PSD of the model. 
(Hint: Use the frequency of the PSD peak.) 


Given the parameters dp, a1, a, and a3 of an AP(3) model, compute its ACS analytically and 
verify your results, using the values in Example 4.2.3. (Hint: Use Cramer’s rule.) 


Consider the following AP(3) model: x(n) = 0.98x(n —3) + w(n), where w(n) ~ WGN(O, 1). 


(a) Plot the PSD of x(n) and check if the obtained process is going to exhibit a pseudoperiodic 
behavior. 

(b) Generate and plot 100 samples of the process. Does the graph support the conclusion of 
part (a)? If yes, what is the period? 

(c) Compute and plot the PSD of the process y(n) = iL (n—1)+x(n)+x(n+ 1)]. 

(d) Repeat part (b) and explain the difference between the behavior of processes x(n) and y(n). 


Consider the following AR(2) models: (i) x(n) = 0.6x(n — 1) + 0.3x(n — 2) + w(n) and (ii) 
x(n) = 0.8x(n — 1) — 0.5x(n — 2) + w(n), where w(n) ~ WGN(O, 1). 


(a) Find the general expression for the normalized autocorrelation sequence p(/), and determine 
2 
Ot. 
(b) Plot {el Mee and check if the models exhibit pseudoperiodic behavior. 
(c) Justify your answer in part (b) by plotting the PSD of the two models. 


(a) Derive the formulas that express the PACS of an AP(3) model in terms of its ACS, using 
the Yule-Walker equations and Cramer’s rule. 


(b) Use the obtained formulas to compute the PACS of the AP(3) model in Example 4.2.3. 
(c) Check the results in part (b) by recomputing the PACS, using the algorithm of Levinson- 
Durbin. 


4.13 Show that the spectrum of any PZ model with real coefficients has zero slope at w = O and 
O=T. 


4.14 Derive Equations (4.2.71) describing the minimum-phase region of the AP(2) model, starting 
from the conditions 


(a) |p| < 1, |p2| < 1 and 
(b) |ki| < 1, |k2| < 1. 


4.15 (a) Show that the spectrum of an AP(2) model with real poles can be obtained by the cascade 
connection of two AP(1) models with real coefficients. 
(b) Compute and plot the impulse response, ACS, PACS, and spectrum of the AP models with 
P| = 0.6, p2 = —0.9, and py = p2 = 0.9. 


4.16 Prove Equation (4.2.89) and demonstrate its validity by plotting the spectrum (4.2.88) for various 
values of r and 0. 


4.17 Prove that if the AP(P) model A(z) is minimum-phase, then 
ir 1 : d 
— og ————~ dw = 
an Jin” |A(el@)yP2 
4.18 (a) Prove Equations (4.2.101) and (4.2.102) and recreate the plot in Figure 4.8(a). 


(b) Determine and plot the regions corresponding to complex and real poles in the autocorre- 
lation domain by recreating Figure 4.8(b). 


4.19 Consider an AR(2) process x(n) with dg = 1, aj = —1.6454 ay = 0.9025, and w(n) ~ 
WGN(O, 1). 

(a) Generate 100 samples of the process and use them to estimate the ACS /, (/), using Equation 
(1.2.1). 


(b) Plot and compare the estimated and theoretical ACS values for 0 < / < 10. 

(c) Use the estimated values of 6, (/) and the Yule-Walker equations to estimate the parameters 
of the model. Compare the estimated with the true values, and comment on the accuracy of 
the approach. 

(d) Use the estimated parameters to compute the PSD of the process. Plot and compare the 
estimated and true PSDs of the process. 

(e) Compute and compare the estimated with the true PACS. 


4.20 Find a minimum-phase model with autocorrelation o(0) = 1, o0(+1) = 0.25, and p(/) = 0 for 
[| = 2. 


4.21 Consider the MA(2) model x(n) = w(n) — 0.1w(n — 1) + 0.2w(n — 2). 


(a) Is the process x(n) stationary? Why? 
(b) Is the model minimum-phase? Why? 
(c) Determine the autocorrelation and partial autocorrelation of the process. 


4.22 Consider the following ARMA models: (i) x(7) = 0.6x(n — 1) + w(n) — 0.9w(n — 1) and 
(ii) x(n) = 1.4x(n — 1) — 0.6x(n — 2) + w(n) — 0.8w(n — 1). 


(a) Find a general expression for the autocorrelation p(/). 

(b) Compute the partial autocorrelation km for m = 1, 2, 3. 

(c) Generate 100 samples from each process, and use them to estimate {6(/ yao using Equation 
(1.2.1). 

(d) Use A(l) to estimate {km}7°. 

(e) Plot and compare the estimates with the theoretically obtained values. 
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4.23 


4.24 


4.25 


4.26 


4.27 


4.28 


4.29 


4.30 


4.31 


4.32 


Determine the coefficients of a PZ(2, 1) model with autocorrelation values 77,(0) = 19, r,(1) = 
9, rn (2) = —5, and rp, (3) = —7. 


(a) Show that the impulse response of an AZ(Q) model can be recovered from its response 
h(n) toa periodic train with period Lif L > Q. 

(b) Show that the ACS of an AZ(Q) model can be recovered from the ACS or spectrum of h(n) 
iflL>2Q+1. 


Prove Equation (4.3.17) and illustrate its validity by computing the PACS of the model H(z) = 
10.8271. 


Prove Equations (4.3.24) that describe the minimum-phase region of the AZ(2) model. 


Consider an AZ(2) model with dy = 2 and zeros z1,7 = 0.95etJ*/3, 


(a) Compute and plot N = 100 output samples by exciting the model with the process w(n) ~ 
WGN (0, 1). 

(b) Compute and plot the ACS, PACS, and spectrum of the model. 

(c) Repeat parts (a) and (b) by assuming that we have an AP(2) model with poles at pj 2 = 
0.95e4/7/3, 

(d) Investigate the duality between the ACS and PACS of the two models. 


Prove Equations (4.4.31) and use them to reproduce the plot shown in Figure 4.12(b). Indicate 
which equation corresponds to each curve. 


Determine the spectral flatness measure of the following processes: 


(a) x(n) = ayx(n — 1) +anx(n — 2) + w(n) and 
(b) x(n) = w(n) + by w(n — 1) +: bow(n — 2), where w(n) is a white noise sequence. 


Consider a zero-mean wide-sense stationary (WSS) process x(n) with PSD Rx (e/ ®) and an 
M x M correlation matrix with eigenvalues {Anyi . Szeg6’s theorem (Grenander and Szegé 
1958) states that if g(-) is a continuous function, then 


sQi) + 8Q2) +t: +8Gm) _ 1} [ 


lim gl Rx (e/”)] dw 
3 


Moo M Qn Je 


Using this theorem, show that 


1 7 
lim (det R,)!/“ =exp{ =f In[Ry(e/”)] ao 
Moo 2a Jz 


Consider two linear random processes with system functions 


-1 =2 =] 
On Oe — = a aid Gn = “= 
(a) Find a difference equation that leads to a numerically stable simulation of each process. 
(b) Generate and plot 100 samples from each process, and look for indications of nonstationarity 
in the obtained records. 
(c) Compute and plot the second difference of (i) and the first difference of (ii). Comment about 
the stationarity of the obtained records. 


Generate and plot 100 samples for each of the linear processes with system functions 
1 
(l—z7!)(1-0.9z71) 
1—0.5z~! 
(l—z7!) (1 -0.9z7!) 
and then estimate and examine the values of the ACS {o(/ We and the PACS fine’: 


(a) H(@Z)= 


(b) H(z) = 


4.33 Consider the process y(n) = dg + djn+ dyn + x(n), where x(n) is a stationary process with 
known autocorrelation r, (/). 
(a) Show that the process y® (n) obtained by passing y(n) through the filter H(z) = (1—z_ 1 2 
is stationary. 
(b) Express the autocorrelation ae (1) of y®) (n) in terms of ry (J). Note: This process is used 
in practice to remove quadratic trends from data before further analysis. 


4.34 Prove Equation (4.6.7), which computes the cepstrum of an AP model from its coefficients. 


4.35 Consider a minimum-phase AZ(Q) model D(z) = > ae dyz_* with complex cepstrum c(k). 
We create another AZ model with coefficients dy = a dg and complex cepstrum c(k). 
(a) If 0 <a <1, find the relation between ¢(k) and c(k). 


(b) Choose a so that the new model has no minimum phase. 
(c) Choose a so that the new model has a maximum phase. 


4.36 Prove Equation (4.6.27), which determines the cepstral distance in the frequency and time 
domains. 
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Nonparametric Power Spectrum Estimation 


The essence of frequency analysis is the representation of a signal as a superposition of 
sinusoidal components. In theory, the exact form of this decomposition (spectrum) depends 
on the assumed signal model. In Chapters 2 and 3 we discussed the mathematical tools 
required to define and compute the spectrum of signals described by deterministic and 
stochastic models, respectively. In practical applications, where only a finite segment of a 
signal is available, we cannot obtain a complete description of the adopted signal model. 
Therefore, we can only compute an approximation (estimate) of the spectrum of the adopted 
signal model (“true” or theoretical spectrum). The quality of the estimated spectrum depends 
on 


e How well the assumed signal model represents the data. 
e What values we assign to the unavailable signal samples. 
e Which spectrum estimation method we use. 


Clearly, meaningful application of spectrum estimation in practical problems requires 
sufficient a priori information, understanding of the signal generation process, knowledge 
of theoretical concepts, and experience. 

In this chapter we discuss the most widely used correlation and spectrum estimation 
methods, as well as their properties, implementation, and application to practical problems. 
We discuss only nonparametric techniques that do not assume a particular functional form, 
but allow the form of the estimator to be determined entirely by the data. These methods are 
based on the discrete Fourier transform of either the signal segment or its autocorrelation 
sequence. In contrast, parametric methods assume that the available signal segment has 
been generated by a specific parametric model (e.g., a pole-zero or harmonic model). Since 
the choice of an inappropriate signal model will lead to erroneous results, the successful 
application of parametric techniques, without sufficient a priori information, is very difficult 
in practice. These methods are discussed in Chapter 9. 

We begin this chapter with an introductory discussion on the purpose of, and the DSP 
approach to, spectrum estimation. We explore various errors involved in the estimation of 
finite-length data records (i.e., based on partial information). We also outline conventional 
techniques for deterministic signals, using concepts developed in Chapter 2. Also in Section 
3.6, we presented important concepts and results from the estimation theory that are used 
extensively in this chapter. Section 5.3 is the main section of this chapter in which we 
discuss various nonparametric approaches to the power spectrum estimation of stationary 
random signals. This analysis is extended to joint stationary (bivariate) random signals 
for the computation of the cross-spectrum in Section 5.4. The computation of auto and 
cross-spectra using Thomson’s multiple windows (or multitapers) is discussed in Section 
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5.5. Finally, in Section 5.6 we summarize important topics and concepts from this chapter. 
A classification of the various spectral estimation methods that are discussed in this book 


Nonparametric Power is provided in Figure 5.1. 


Spectrum Estimation 


Deterministic 
signal model: 
Fourier 
analysis 
(Section 5.1) 


Main 
limitation: 
windowing 


Mainlobe 
width: 
smoothing, 
loss of 
resolution 


Spectral 
estimation 


Stochastic 
signal 
models 


Main 
limitations: 
windowing + 
randomness 


Sidelobe 
height: leakage, 
"wrong" 
location of 
peaks 


Nonparametric 
methods 


Capon's ARMA 

minimum (pole-zero) 

variance models 
(Chapter 9) (Chapter 9) 


Fourier 
analysis 
(Section 5.3) 


- Autocorrelation 


windowing 
- Periodogram 
averaging 


Multitaper 
method 
(Section 5.5) 


Bias + 
randomness 


Parametric 
methods 


Harmonic 
process 


(Chapter 9) 


Long-memory 
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FIGURE 5.1 


Classification of various spectrum estimation methods. 


5.1 SPECTRAL ANALYSIS OF DETERMINISTIC SIGNALS 


If we adopt a deterministic signal model, the mathematical tools for spectral analysis are the 
Fourier series and the Fourier transforms summarized in Section 2.2.1. It should be stressed 
at this point that applying any of these tools requires that the signal values in the entire 
time interval from —oo to +00 be available. If it is known a priori that a signal is periodic, 
then only one period is needed. The rationale for defining and studying various spectra for 
deterministic signals is threefold. First, we note that every realization (or sample function) 
of a stochastic process is a deterministic function. Thus we can use the Fourier series and 
transforms to compute a spectrum for stationary processes. Second, deterministic functions 


and sequences are used in many aspects of the study of stationary processes, for example, 
the autocorrelation sequence, which is a deterministic sequence. Third, the various spectra 
that can be defined for deterministic signals can be used to summarize important features 
of stationary processes. 

Most practical applications of spectrum estimation involve continuous-time signals. 
For example, in speech analysis we use spectrum estimation to determine the pitch of 
the glottal excitation and the formants of the vocal tract (Rabiner and Schafer 1978). In 
electroencephalography, we use spectrum estimation to study sleep disorders and the effect 
of medication on the functioning of the brain (Duffy, Iyer, and Surwillo 1989). Another 
application is in Doppler radar, where the frequency shift between the transmitted and the 
received waveform is used to determine the radial velocity of the target (Levanon 1988). 

The numerical computation of the spectrum of a continuous-time signal involves three 
steps: 


1. Sampling the continuous-time signal to obtain a sequence of samples. 

2. Collecting a finite number of contiguous samples (data segment or block) to use for the 
computation of the spectrum. This operation, which usually includes weighting of the 
signal samples, is known as windowing, or tapering. 

3. Computing the values of the spectrum at the desired set of frequencies. This step is 
usually implemented using some efficient implementation of the DFT. 


The above processing steps, which are necessary for DFT-based spectrum estimation, 
are shown in Figure 5.2. The continuous-time signal is first processed through a low-pass 
(antialiasing) filter and then sampled to obtain a discrete-time signal. Data samples of frame 
length N with frame overlap No are selected and then conditioned using a window. Finally, 
a suitable-length DFT of the windowed data is taken as an estimate of its spectrum, which 
is then analyzed. In this section, we discuss in detail the effects of each of these operations 
on the accuracy of the computed spectrum. The understanding of the implications of these 
effects is very important in all practical applications of spectrum estimation. 


Low-pass filter 
A (F) 


Frame 
blocking 


FIGURE 5.2 


DFT-based Fourier analysis system for continuous-time signals. 


5.1.1 Effect of Signal Sampling 


The continuous-time signal s,(t), whose spectrum we seek to estimate, is first passed through 
a low-pass filter, also known as an antialiasing filter Hj) (F’), in order to minimize the aliasing 
error after sampling. The antialiased signal x,(t) is then sampled through an analog-to- 
digital converter’ (ADC) to produce the discrete-time sequence x (7), that is, 


X(N) = Xe(t)|r=n/F, (5.1.1) 


We will ignore the quantization of discrete-time signals as discussed in Chapter 2. 
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From the sampling theorem in Section 2.2.2, we have 


[o,@) 
X (eft F/Fs) — FP, os NAF =1FS (5.1.2) 
l=—o00 

where X.(F) = Hip(F)S-(F). We note that the spectrum of the discrete-time signal x (1) is 
a periodic replication of X,(F’). Overlapping of the replicas X,(F —/F;,) results in aliasing. 
Since any practical antialiasing filter does not have infinite attenuation in the stopband, 
some nonzero overlap of frequencies higher than F,/2 should be expected within the band 
of frequencies of interest in x(n). These aliased frequencies give rise to the aliasing error, 
which, in any practical signal, is unavoidable. It can be made negligible by a properly 
designed antialiasing filter Hip(F). 


5.1.2. Windowing, Periodic Extension, and Extrapolation 


In practice, we compute the spectrum of a signal by using a finite-duration segment. The 
reason is threefold: 


1. The spectral composition of the signal changes with time. or 
2. We have only a finite set of data at our disposal. or 
3. We wish to keep the computational complexity to an acceptable level. 


Therefore, itis necessary to partition x () into blocks (or frames) of data prior to processing. 
This operation is called frame blocking, and it is characterized by two parameters: the length 
of frame N and the overlap between frames No (see Figure 5.2). Therefore, the central 
problem in practical frequency analysis can be stated as follows: 


Determine the spectrum of a signal x(n), —oo <n < oo, from its values in a finite 
intervalO <n < N —1, that is, from a finite-duration segment. 


Since x(7) is unknown for n < 0 andn > N, we cannot say, without having sufficient 
a priori information, whether the signal is periodic or aperiodic. If we can reasonably assume 
that the signal is periodic with fundamental period N, we can easily determine its spectrum 
by computing its Fourier series, using the DFT (see Section 2.2.1). 

However, in most practical applications, we cannot make this assumption because the 
available block of data could be either part of the period of a periodic signal or a segment 
from an aperiodic signal. In such cases, the spectrum of the signal cannot be determined 
without assigning values to the signal samples outside the available interval. There are three 
ways to deal with this issue: 


1. Periodic extension. We assume that x(n) is periodic with period N, that is, x(n) = 
x(n + N) for all n, and we compute its Fourier series, using the DFT. 

2. Windowing. We assume that the signal is zero outside the interval of observation, that 
is, x(n) = 0 forn < 0 andn > N. This is equivalent to multiplying the signal with the 
rectangular window 

(n) 4 1 O0<n<N-1 (5.1.3) 
wWR(n) = i, 
: 0 elsewhere 
The resulting sequence is aperiodic, and its spectrum is obtained by the discrete-time 
Fourier transform (DTFT). 

3. Extrapolation. We use a priori information about the signal to extrapolate (i.e., determine 
its values for n < 0 and n > N) outside the available interval and then determine its 
spectrum by using the DTFT. 


Periodic extension and windowing can be considered the simplest forms of extrapola- 
tion. It should be obvious that a successful extrapolation results in better spectrum estimates 


than periodic extension or windowing. Periodic extension is a straightforward application 
of the DFT, whereas extrapolation requires some form of a sophisticated signal model. As 
we shall see, most of the signal modeling techniques discussed in this book result in some 
kind of extrapolation. We first discuss, in the next section, the effect of spectrum sampling 
as imposed by the application of DFT (and its side effect—the periodic extension) before 
we provide a detailed analysis of the effect of windowing. 


5.1.3 Effect of Spectrum Sampling 


In many real-time spectrum analyzers, as illustrated in Figure 5.2, the spectrum is com- 
puted (after signal conditioning) by using the DFT. From Section 2.2.3, we note that this 
computation samples the continuous spectrum at equispaced frequencies. Theoretically, if 
the number of DFT samples is greater than or equal to the frame length N, then the exact 
continuous spectrum (based on the given frame) can be obtained by using the frequency- 
domain reconstruction (Oppenheim and Schafer 1989; Proakis and Manolakis 1996). This 
reconstruction, which requires a periodic sinc function [defined in (5.1.9)], is not a practical 
function to implement, especially in real-time applications. Hence a simple linear interpola- 
tion is used for plotting or display purposes. This linear interpolation can lead to misleading 
results even though the computed DFT sample values are correct. It is possible that there 
may not be a DFT sample precisely at a frequency where a peak of the DTFT is located. 
In other words, the DFT spectrum misses this peak, and the resulting linearly interpolated 
spectrum provides the wrong location and height of the DTFT spectrum peak. This error 
can be made smaller by sampling the DTFT spectrum at a finer grid, that is, by increasing 
the size of the DFT. The denser spectrum sampling is implemented by an operation called 
zero padding and is discussed later in this section. 

Another effect of the application of DFT for spectrum calculations is the periodic 
extension of the sequence in the time domain. From our discussion in Section 2.2.3, it 
follows that the N-point DFT 


N-1 
X= So eee Tee (5.1.4) 
n=0 


is periodic with period N. This should be expected given the relationship of the DFT to 
the Fourier transform or the Fourier series of discrete-time signals, which are periodic in w 
with period 27. A careful look at the inverse DFT 


N-1 
1 7 ? 
x(n) = i be X (kjef Or /N)kn (5.1.5) 
k=0 


reveals that x (7) is also periodic with period N. This is a somewhat surprising result since 
no assumption about the signal x(n) outside the interval 0 < n < N — 1 has been made. 
However, this periodicity in the time domain can be easily justified by recalling that sampling 
in the time domain results in a periodicity in the frequency domain, and vice versa. 

To understand these effects of spectrum sampling, consider the following example in 
which a continuous-time sinusoidal signal is sampled and then is truncated by a rectangular 
window before its DFT is performed. 


EXAMPLE 5.1.1. A continuous-time signal x¢(t) = 2cos2zt is sampled with a sampling fre- 
quency of Fs = 1/T = 10 samples per second, to obtain the sequence x(n). It is windowed 
by an N-point rectangular window wp (n) to obtain the sequence x,y (n). Determine and plot 
[Xn (k)|, the magnitude of the DFT of xy (n), for (a) N = 10 and (b) N = 15. Comment on the 
shapes of these plots. 
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Solution. The discrete-time signal x(n) is a sampled version of x¢(t) and is given by 


2mn 
x(n) = Xc(t = nT) = 2cos = 2cos0.27n T=0.1s 


S 


Then, x(n) is a periodic sequence with fundamental period N = 10. 


a. 


For N = 10, we obtain xy(n) = 2cos0.47n,0 <n < 9, which contains one period of 
x(n). The periodic extension of x, (n) and the magnitude plot of its DFT are shown in the 
top row of Figure 5.3. For comparison, the DTFT Xv (eJ”) of xy (n) is also superimposed 
on the DFT samples. We observe that the DFT has only two nonzero samples, which together 
constitute the correct frequency of the analog signal x¢(t). The DTFT has a mainlobe and 
several sidelobes due to the windowing effect. However, the DFT samples the sidelobes at 
their zero values, as illustrated in the DFT plot. Another explanation for this behavior is that 
since the samples in x,y (n) for N = 10 constitute one full period of cos 0.47rn, the 10-point 
periodic extension of x, (n), shown in the top left graph of Figure 5.3, results in the original 
sinusoidal sequences x(n). Thus what the DFT “sees” is the exact sampled signal x¢(¢). In 
this case, the choice of N is a desirable one. 


. For N = 15, we obtain xy(n) = 2cos0.42n,0 < n < 14, which contains ii periods 


of x(n). The periodic extension of xj (7) and the magnitude plot of its DFT are shown in 
the bottom row of Figure 5.3. Once again for comparison, the DTFT X y(e/®) of xy (n) 
is superimposed on the DFT samples. In this case, the DFT plot looks markedly different 
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Effect of window length L on the DFT spectrum shape. 


from that for N = 10 although the DTFT plot appears to be similar. In this case, the DFT 
does not sample two peaks at the exact frequencies; hence if the resulting DFT samples are 
joined by the linear interpolation, then we will get a misleading result. Since the sequence 
xy (n) does not contain full periods of cos 0.47n, the periodic extension of x, (n) contains 
discontinuities atn = /N,/ = 0,+1,+2,..., as shown in the bottom left graph of Figure 
5.3. This discontinuity results in higher-order harmonics in the DFT values. The DTFT plot 
also has mainlobes and sidelobes, but the DFT samples these sidelobes at nonzero values. 
Therefore, the length of the window is an important consideration in spectrum estimation. 
The sidelobes are the source of the problem of leakage that gives rise to bias in the spectral 
values, as we will see in the following section. The suppression of the sidelobes is controlled 
by the window shape, which is another important consideration in spectrum estimation. 


A quantitative description of the above interpretations and arguments related to the 
capacities and limitations of the DFT is offered by the following result (see Proakis and 
Manolakis 1996). 


THEOREM 5.1 (DFT SAMPLING THEOREM). Let xc(t), —cO < t < ©, be a continuous- 
time signal with Fourier transform X.(F), —oo < F < oo. Then, the N-point sequences 
{Txp(n),0 <n < N — I} and {Xp(k),0 < k < N — 1} form an N-point DFT pair, that is, 


[o,e) [o,e) 


a F. 
xpin)& > xe(nT — mNT) = RWAKR YO Xe («se -1F,) (5.1.6) 
m=—0Oo l=—0o 


where F, = 1/T is the sampling frequency. 
Proof. The proof is explored in Problem 5.1. 


Thus, given a continuous-time signal x, (t) and its spectrum X,.(F’), we can create a DFT 
pair by sampling and aliasing in the time and frequency domains. Obviously, this DFT pair 
provides a “faithful” description of x,(t) and X_.(F) if both the time-domain aliasing and the 
frequency-domain aliasing are insignificant. The meaning of relation (5.1.6) is graphically 
illustrated in Figure 5.4. In this figure, we show the time-domain signals in the left column 
and their Fourier transforms in the right column. The top row contains continuous-time 
signals, which are shown as nonperiodic and of infinite extent in both domains, since many 
real-world signals exhibit this behavior. The middle row contains the sampled version of 
the continuous-time signal and its periodic Fourier transform (the nonperiodic transform 
is shown as a dashed curve). Clearly, aliasing in the frequency domain is evident. Finally, 
the bottom row shows the sampled (periodic) Fourier transform and its correponding time- 
domain periodic sequence. Again, aliasing in the time domain should be expected. Thus 
we have sampled and periodic signals in both domains with the certainty of aliasing one 
domain and the possibility in both domains. This figure should be recalled any time we use 
the DFT for the analysis of sampled signals. 


Zero padding 


The N-point DFT values of an N-point sequence x(n) are samples of the DIFT X (e/“), 
as discussed in Chapter 2. These samples can be used to reconstruct the DTFT X (e/”) by 
using the periodic sinc interpolating function. Alternatively, one can obtain more (i.e., dense) 
samples of the DTFT by computing a larger Nprr-point DFT of x(1), where Nerr > N. 
Since the number of samples of x (7) is fixed, the only way we can treat x(m) as an Nerr- 
point sequence is by appending Nprr — N zeros to it. This procedure is called the zero 
padding operation, and it is used for many purposes including the augmentation of the 
sequence length so that a power-of-2 FFT algorithm can be used. In spectrum estimation, 
zero padding is primarily used to provide a better-looking plot of the spectrum of a finite- 
length sequence. This is shown in Figure 5.5 where the magnitude of an Nprr-point DFT of 
the eight-point sequence x(n) = cos (27/4) is plotted for Nerr = 8, 16, 32, and 64.The 
DTFT magnitude |X (e/“)| is also shown for comparison. It can be seen that as more zeros 
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are appended (by increasing Neprr), the resulting larger-point DFT provides more closely 
spaced samples of the DTFT, thus giving a better-looking plot. Note, however, that the zero 
padding does not increase the resolution of the spectrum; that is, there are no new peaks 
and valleys in the display, just a better display of the available information. This type of plot 
is called a high-density spectrum. For a high-resolution spectrum, we have to collect more 
information by increasing N. The DTFT plots shown in Figures 5.3 and 5.5 were obtained 
by using a very large amount of zero padding. 


5.1.4 Effects of Windowing: Leakage and Loss of Resolution 


To see the effect of the window on the spectrum of an arbitrary deterministic signal x(n), 
defined over the entire range —oo < n < oo, we notice that the available data record can 
be expressed as 


Xv (n) = x(n) wR (n) (5.1.7) 
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Graphical illustration of the DFT sampling theorem. 
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Effect of zero padding. 


where wr(n) is the rectangular window defined in (5.1.3). Thus, a finite segment of the 
signal can be thought of as a product of the actual signal x(n) and a data window w(n). In 
(5.1.7), w(7) = wrR(n), but w(n) can be any arbitrary finite-duration sequence. The Fourier 
transform of xy (n) is 


; ‘ ; 1 a . , 
Xy(e!®) = X(e/”) @® W(e/®) & | X(el9)W(ei@) do (5.1.8) 
T J—n 
that is, Xy(e/@) equals the periodic convolution of the actual Fourier transform with the 
Fourier transform W(e/”) of the data window. For the rectangular window, W(e/®) = 
Wr(e/”), where 

Weel) = | SBN) --joww—/2 & agaye®—O2 5.19) 

sin (@/2) 


The function A(w) is a periodic function in @ with fundamental period equal to 27 and is 
called a periodic sinc function. Figure 5.6 shows three periods of A(w) for N = 11. We 
note that Wp (e/®) consists of a mainlobe (ML). 


y) 
| wre) lake = 
Wo(e!®) = ro (5.1.10) 
0 —<|o|<az 
N 


and the sidelobes Ws, (e/”) = Wr(e/”) — Wo (e/”). Thus, (5.1.8) can be written as 


Xy(el”) = X(e/”) @ Wu (e/”) + X(e/”) @ Wsi(e/”) (5.1.11) 
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Plot of A(w) = sin (wN/2)/sin (@/2) for N = 11. 


The first convolution in (5.1.11) smoothes rapid variations and suppresses narrow peaks 
in X (e/ ©), whereas the second convolution introduces ripples in smooth regions of X (e/ ®) 
and can create “false” peaks. Therefore, the spectrum we observe is the convolution of the 
actual spectrum with the Fourier transform of the data window. The only way to improve 
the estimate is to increase the window length N or to choose another window shape. For 
the rectangular window, increasing N results in a narrower mainlobe, and the distortion 
is reduced. As N > 00, Wr(e/”) tends to an impulse train with period 27 and Xj (e/®) 
tends to X (e/”), as expected. Since in practice the value of N is always finite, the only way 
to improve the estimate Xv (e/”) is by properly choosing the shape of the window w(n). 
The only restriction on w(n) is that it be of finite duration. 

It is known that any time-limited sequence w(m) has a Fourier transform W (e/®) that is 
nonzero except at a finite number of frequencies. Thus, from (5.1.8) we see that the estimated 
value X y(e/0) is computed by using all values of X (e/”) weighted by W(e/‘?0—®). The 
contribution of the sinusoidal components with frequencies w 4 wo to the value X y (e/0) 
introduces an error known as leakage. As the name suggests, energy from one frequency 
range “leaks” into another, giving the wrong impression of stronger or weaker frequency 
components. 

To illustrate the effect of the window shape and duration on the estimated spectrum, 
consider the signal 


x(n) = cos0.352n + cos 0.42n + 0.25 cos 0.8710 (5.1.12) 


which has a line spectrum with lines at frequencies mw; = 0.357, m2 = 0.477, and w3 = 
0.82. This line spectrum (normalized so that the magnitude is between 0 and 1) is shown 
in the top graph of Figure 5.7 over 0 < w < x. The spectrum X y(e/”) of xy(n) using the 
rectangular window is given by 


Xy(el®) = S[W (efron) + Wei @-D) + Wei @t@2)) 4+ W(ei@-22)) 
+ 0.25 W (ef (@+3)) + 0.25 W (e/(@-23))] 


The second and the third plots in Figure 5.7 show 2048-point DFTs of x, (n) for a rect- 
angular data window with N = 21 and N = 81. We note that the ability to pick out peaks 
(resolvability) depends on the duration N — | of the data window.’ To resolve two spectral 
lines at m = w; and w = @? using a rectangular window, we should have the difference 
|@1 — w2| greater than the mainlobe width Aw, which is approximately equal to 27 /(N—1), 
in radians per sampling interval, from the plot of A(@) in Figure 5.6, that is, 

21 20 


or N > —— 41 
=] |@1 — w2| 


(5.1.13) 


|w1 — w2| > Aw & 


"Since there are N samples in a data window, the number of intervals or durations is N — 1. 
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FIGURE 5.7 
Spectrum of three sinusoids using rectangular and Hamming 
windows. 


For a rectangular window of length N, the exact value of Aw is equal to 1.812 /(N — 1). 
If N is too small, the two peaks at m = 0.357 and w = 0.47 are fused into one, as shown 
in the N = 21 plot. When N = 81, the corresponding plot shows a resolvable separation; 
however, the peaks have shifted somewhat from their true locations. This is called bias, and 
it is a direct result of the leakage from sidelobes. In both cases, the peak at w = 0.87 can 
be distinguished easily (but also has a bias). 

Another important observation is that the sidelobes of the data window introduce false 
peaks. For a rectangular window, the peak sidelobe level is 13 dB below zero, which is not 
a good attenuation. Thus these false peaks have values that are comparable to that of the 
true peak at w = 0.877, as shown in Figure 5.7. These peaks can be minimized by reducing 
the amplitudes of the sidelobes. The rectangular window cannot help in this regard because 
of Gibb’s well-known phenomenon associated with it. We need a different window shape. 
However, any window other than the rectangular window has a wider mainlobe; hence this 
reduction can be achieved only at the expense of the resolution. To illustrate this, consider 
the Hamming (Hm) data window, given by 


0.54 — 0.46 cos 0<n<N-1 
WHm(n) = Nt pl (5.1.14) 


0 otherwise 


with the approximate width of the mainlobe equal to 82 /(N — 1) and the exact mainlobe 
width equal to 6.277 /(N — 1). The peak sidelobe level is 43 dB below zero, which is 
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considerably better than that of the rectangular window. The Hamming window is obtained 
by using the hamming (N) function in MATLAB. 

The bottom plot in Figure 5.7 shows the 2048-point DFT of the signal x, (v) for a 
Hamming window with N = 81. Now the peak at w = 0.87 is more prominent than 
before, and the sidelobes are almost suppressed. Note also that since the mainlobe width 
of the Hamming window is wider, the peaks have a wider base—so much so that the first 
two frequencies are barely recognized. We can correct this problem by choosing a larger 
window length. This interplay between the shape and the duration of a window function is 
one of the important issues and, as we will see in Section 5.3, produces similar effects in 
the spectral analysis of random signals. 


Some useful windows 


The design of windows for spectral analysis applications has drawn a lot of attention and 
is examined in detail in Harris (1978). We have already discussed two windows, namely, 
the rectangular and the Hamming window. Another useful window in spectrum analysis 
is due to Hann and is mistakenly known as the Hanning window. There are several such 
windows with varying degrees of tradeoff between resolution (mainlobe width) and leakage 
(peak sidelobe level). These windows are known as fixed windows since each provides a 
fixed amount of leakage that is independent of the length NV. Unlike fixed windows, there 
are windows that contain a design parameter that can be used to trade between resolution 
and leakage. Two such windows are the Kaiser window and the Dolph-Chebyshev window, 
which are widely used in spectrum estimation. Figure 5.8 shows the time-domain window 
functions and their corresponding frequency-domain log-magnitude plots in decibels for 
these five windows. The important properties such as peak sidelobe level and mainlobe 
width of these windows are compared in Table 5.1. 


TABLE 5.1 
Comparison of properties of commonly used windows. Each window is assumed to be 
of length N. 


Window Peak sidelobe Approximate Exact 
type level (dB) mainlobe width mainlode width 
4x 1.81 
Rectangular —13 
N-1 N-1 
. 82 5.012 
Hanning —32 
N-1 N-1 
8 6.27 
Hamming —43 as a 
N-1 N-1 
: A-8 
Kaiser —A _ — 
2.285N — 1 
-1 
sh-! 194/20 
Dolph-Chebyshev —A —_ cos! (os 4 


Hanning window. This window is given by the function 


0.5 — 0.5 cos O<n<N-1 
WHn(n) = N-1 (5.1.15) 


0 otherwise 


which is a raised cosine function. The peak sidelobe level is 32 dB below zero, and the 
approximate mainlobe width is 82r /(N — 1) while the exact mainlobe width is 5.017 /(N — 
1). In MaTLas this window function is obtained through the function hanning (N). 


Kaiser window. This window function is due to J. F. Kaiser and is given by 207 


SECTION 5.1 _ 
Io {By — [1 -—2n/(N - DP Spectral Analysis of 
wK(n) = (A) O<n<N-I1 (5.1.16) Deterministic Signals 


0 otherwise 


where Jo(-) is the modified zero-order Bessel function of the first kind and #6 is a win- 
dow shape parameter that can be chosen to obtain various peak sidelobe levels and the 
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FIGURE 5.8 
Time-domain window functions and their frequency-domain characteristics for rectangular, Hanning, 
Hamming, Kaiser, and Dolph-Chebyshev windows. 
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corresponding mainlobe widths. Clearly, 8 = 0 results in the rectangular window while 
B > O results in lower sidelobe leakage at the expense of a wider mainlobe. Kaiser has 
developed approximate design equations for 6. Given a peak sidelobe level of A dB below 
the peak value, the approximate value of £ is given by 


0 A<21 
BX $0.5842(A — 21)94 + 0.07886(A — 21) 21<A<50 (5.1.17) 
0.1102(A — 8.7) A> 50 


Furthermore, to achieve the given values of the peak sidelobe level of A and the mainlobe 
width Aw, the length N must satisfy 


_ A-—8 
~ 2.285(N — 1) 
In MATLAB this window is given by the function kaiser (N, beta). 


Aw (5.1.18) 


Dolph-Chebyshev window. This window is characterized by the property that the peak 
sidelobe levels are constant; that is, it has an “equiripple” behavior. The window wpc (7) is 
obtained as the inverse DFT of the Chebyshev polynomial evaluated at N equally spaced 
frequencies around the unit circle. The details of this window function computation are 
available in Harris (1978). The parameters of the Dolph-Chebyshev window are the constant 
sidelobe level A in decibels, the window length N, and the mainlobe width Aw. However, 
only two of the three parameters can be independently specified. In spectrum estimation, 
parameters N and A are generally specified. Then Aw is given by 


cosh~! 104/20 = 
— ea (5.1.19) 


Aw = cos”! (cos 

In MaTLas this window is obtained through the function chebwin(N,A). 
To illustrate the usefulness of these windows, consider the same signal containing 
three frequencies given in (5.1.12). Figure 5.9 shows the spectrum of x, (m) using the 
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FIGURE 5.9 
Spectrum of three sinusoids using Hanning, Kaiser, and 
Chebyshev windows. 


Hanning, Kaiser, and Chebyshev windows for length N = 81. The Kaiser and Chebyshev 
window parameters are adjusted so that the peak sidelobe level is 40 dB or below. Clearly, 
these windows have suppressed sidelobes considerably compared to that of the rectangular 
window but the main peaks are wider with negligible bias. The two peaks in the Hanning 
window spectrum are barely resolved because the mainlobe width of this window is much 
wider than that of the rectangular window. The Chebyshev window spectrum has uniform 
sidelobes while the Kaiser window spectrum shows decreasing sidelobes away from the 
mainlobes. 


5.1.5 Summary 


In conclusion, the frequency analysis of deterministic signals requires a careful study of 
three important steps. First, the continuous-time signal x(t) is sampled to obtain samples 
x(n) that are collected into blocks or frames. The frames are “conditioned” to minimize 
certain errors by multiplying by a window sequence w(n) of length NV. Finally the windowed 
frames xy (n) are transformed to the frequency domain using the DFT. The resulting DFT 
spectrum X y(k) is a faithful replica of the actual spectrum X,(F) if the following errors 
are sufficiently small. 


Aliasing error. This is an error due to the sampling operation. If the sampling rate is 
sufficiently high and if the antialiasing filter is properly designed so that most of the 
frequencies of interest are represented in x (7), then this error can be made smaller. 
However, a certain amount of aliasing should be expected. The sampling principle 
and aliasing are discussed in Section 2.2.2. 

Errors due to finite-length window. There are several errors such as resolution loss, 
bias, and leakage that are attributed to the windowing operation. Therefore, a care- 
ful design of the window function and its length is necessary to minimize these 
errors. These topics were discussed in Section 5.1.4. In Table 5.1 we summarize 
key properties of five windows discussed in this section that are useful for spectrum 
estimation. 

Spectrum reconstruction error. The DFT spectrum Xn (k) is a number sequence that 
must be reconstructed into a continuous function for the purpose of plotting. A 
practical choice for this reconstruction is the first-order polynomial interpolation. 
This reconstruction error can be made smaller (and in fact comparable to the screen 
resolution) by choosing a large number of frequency samples, which can be achieved 
by the zero padding operation in the DFT. It was discussed in Section 5.1.3. 


With the understanding of frequency analysis concepts developed in this section, we 
are now ready to tackle the problem of spectral analysis of stationary random signals. From 
Chapter 3, we recognize that the true spectral values can only be obtained as estimates. This 
requires some understanding of key concepts from estimation theory, which is developed 
in Section 3.6. 


5.2 ESTIMATION OF THE AUTOCORRELATION OF STATIONARY 
RANDOM SIGNALS 


The second-order moments of a stationary random sequence—that is, the mean value j,, 
the autocorrelation sequence r,(/), and the PSD R,(e/®)—play a crucial role in signal 
analysis and signal modeling. In this section, we discuss the estimation of the autocorrelation 
sequence r,(/) using a finite data record {x (n)}q. ~! of the process. 


209 


SECTION 5.2 

Estimation of the 
Autocorrelation of 
Stationary Random Signals 


210 


CHAPTER 5, 
Nonparametric Power 
Spectrum Estimation 


For a stationary process x(n), the most widely used estimator of 7, (/) is given by the 
sample autocorrelation sequence 


1 Nets! 
i + x(nt+D)x*(n) OK<1<N-1 
Fr (l) & ae (5.2.1) 
r*(-D) -(N-1) <1 <0 
0 elsewhere 
or, equivalently, 
ries?! 
yD extn —D 0<I1<N-1 
Ax) & ae (5.2.2) 
r*(-D -—(N—-1)<1<0 
0 elsewhere 


which is a random sequence. Note that without further information beyond the observed 
data {x(n)}q. a it is not possible to provide reasonable estimates of r;,(J) for |/| > N. 
Even for lag values |/| close to N, the correlation estimates are unreliable since very few 
x(n + |/|)x(n) pairs are used. A good rule of thumb provided by Box and Jenkins (1976) is 
that N should be at least 50 and that |/| < N/4. The sample autocorrelation 7, (J) given in 
(5.2.1) has a desirable property that for each / > 1, the sample autocorrelation matrix 


(0) aa) ome ee Oe D 

R mae r(0 tet PEON ESD 

R= oe om = ) 6.23) 
Fx(N—1) f(N-—2) +++ Fy) 


is nonnegative definite (see Section 3.5.1). This property is explored in Problem 5.5. MAT- 
LAB provides functions to compute the correlation matrix R, (for example, corr), given the 
data {x(n)} ee however, the book toolbox function rx = autoc(x,L); computes r, (1) 
according to (5.2.1) very efficiently. 

The estimate of covariance y , (/) from the data record {x(n)} Ms Nig given by the sample 
autocovariance sequence 
, Noe 
y DL b+) fs) — fy] O<I<N-1 
,0= oY (5.2.4) 
Pr (2) —(W-1)s1<0 


0) elsewhere 


so that the corresponding autocovariance matrix ry is nonnegative definite. Similarly, the 
sample autocorrelation coefficient sequence (, (J) is given by 
a) = 22 (5.2.5) 
x 

In the rest of this section, we assume that x(n) is a zero-mean process and hence 7; (/) = 
y,(1), so that we can discuss the autocorrelation estimate in detail. 

To determine the statistical quality of this estimator, we now consider its mean and 
variance. 


Mean of r, (1). We first note that (5.2.1) can be written as 


r,(l) = ah Ss x(nt+)Dwnt+)x*(n)w(n) [1| >0 (5.2.6) 
e 7 > 2: 


n=—Oo 


1 O<n<N-I1 

where w(n) = wr(n) = (5.2.7) 
0 elsewhere 

is the rectangular window. The expected value of 7, (/) is 


CO 


E{r,(D)} = 2 = E{x(n + D)x*(n)}w(n + Dw(n) 1>0 
x N =. a 
and E{?,(—l)} = E{Fi(D} -—1<0 
Therefore E{7,(D} = <r (1)rw(l) (5.2.8) 
where ry(l) = wl) * w(-l) = a w(n)w(n +1) (5.2.9) 


is the autocorrelation of the window sequence. For the rectangular window 


a fN-Wl WsN-1 
rw(l) = wan) = (5.2.10) 
0 elsewhere 
which is the unnormalized triangular or Bartlett window. Thus 
7 1 l 
E{r,D} = ys Ou) =ry(/) (1 a a) wrR(n) (5.2.11) 


Therefore, we conclude that the relation (5.2.1) provides a biased estimate of r, (J) because 
the expected value of 7,(/) from (5.2.11) is not equal to the true autocorrelation r,(/). 
However, 7; (/) is an asymptotically unbiased estimator since if N > oo, E{?,()} > r,(D. 
Clearly, the bias is small if 7,,(/) is evaluated for |/| < L, where L is the maximum desired 
lagandlL<QN. 


Variance of ry (1). An approximate expression for the covariance of 7, (/) is given by 
Jenkins and Watts (1968) 


[ee 


cov {Fx (11), Fx (I2)} ~ x Yo irOn@th-—h)+n@+hyr@-h)] (5.2.12) 


l=—o0o 


This indicates that successive values of 7, (/) may be highly correlated and that 7,.(/) may 
fail to die out even if it is expected to. This makes the interpretation of autocorrelation 
graphs quite challenging because we do not know whether the variation is real or statistical. 

The variance of 7, (/), which can be obtained by setting /; = Jy in (5.2.12), tends to zero 
as N —> oo. Thus, 7, (/) provides a good estimate of r, (/) if the lag |/| is much smaller than 
N. However, as |/| approaches NV, fewer and fewer samples of x(n) are used to evaluate 
r,(1). As a result, the estimate 7, (/) becomes worse and its variance increases. 


Nonnegative definiteness of r,.(1). An alternative estimator for the autocorrelation se- 
quence is given by 


N-I-1 
1 
aa DA ae) O27 4L-2N 
7, (I) = — (5.2.13) 
r(—l) -N<-L<I1<0 
0 elsewhere 


Although this estimator is unbiased, it is not used in spectral estimation because of its 
negative definiteness. In contrast, the estimator 7,(/) from (5.2.1) is nonnegative definite, 
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and any spectral estimates based on it do not have any negative values. Furthermore, the 
estimator 7, (/) has smaller variance and mean square error than the estimator 7, (/) (Jenkins 
and Watts 1968). Thus, in this book we use the estimator 7, (/) defined in (5.2.1). 


5.3 ESTIMATION OF THE POWER SPECTRUM OF STATIONARY 
RANDOM SIGNALS 


From a practical point of view, most stationary random processes have continuous spectra. 
However, harmonic processes (i.e., processes with line spectra) appear in several appli- 
cations either alone or in mixed spectra (a mixture of continuous and line spectra). We 
first discuss the estimation of continuous spectra in detail. The estimation of line spectra is 
considered in Chapter 9. 

The power spectral density of a zero-mean stationary stochastic process was defined 

in (3.3.39) as 
[o,@) 
Re) = > 1, Dei! (5.3.1) 
l=—00 

assuming that the autocorrelation sequence r,(/) is absolutely summable. We will deal 
with the problem of estimating the power spectrum R,(e/”) of a stationary process x(n) 
from a finite record of observations {x(n)}q —l ofa single realization. The ideal goal is to 
devise an estimate that will faithfully characterize the power-versus-frequency distribution 
of the stochastic process (i.e., all the sequences of the ensemble) using only a segment of a 
single realization. For this to be possible, the estimate should typically involve some kind 
of averaging among several realizations or along a single realization. 

In some practical applications (e.g., interferometry), it is possible to directly measure 
the autocorrelation r,(/), |/| < LZ < N with great accuracy. In this case, the spectrum 
estimation problem can be treated as a deterministic one, as described in Section 5.1. We 
will focus on the “stochastic” version of the problem, where R, (e/”) is estimated from the 
available data {x(n)} 7. A natural estimate of Ry (e/”), suggested by (5.3.1), is to estimate 
rx (J) from the available data and then transform it by using (5.3.1). 


5.3.1 Power Spectrum Estimation Using the Periodogram 


The periodogram is an estimator of the power spectrum, introduced by Schuster (1898) in 
his efforts to search for hidden periodicities in solar sunspot data. The periodogram of the 
data segment {x(n)}Q! is defined by 


N-1 2 


Si v(nje Io" 


Rie as 
n=0 


es a Ivie)P (5.3.2) 


where V (e/”) is the DTFT of the windowed sequence 
v(n) = x(n)w(n) 0<n<N-1 (5.3.3) 


The above definition of the periodogram stems from Parseval’s relation (2.2.10) on the 
power of a signal. The window w(n), which has length N, is known as the data window. 
Usually, the term periodogram is used when w(n) is a rectangular window. In contrast, the 
term modified periodogram is used to stress the use of nonrectangular windows. The values 
of the periodogram at the discrete set of frequencies {wy = 27k/N 1p —! can be calculated 
by 


A 


x : Tes 
R,(k) © Ry (el? 7/N) = ylvor ba06. NS) (5.3.4) 


where V(k) is the N-point DFT of the windowed segment v(). In MATLAB, the modified 
periodogram computation is implemented by using the function 


Rx = psd(x,Nfft,Fs, window (N),’none’); 


where window is the name of any MATLAB-provided window function (e.g., hamming); Nf ft 
is the size of the DFT, which is chosen to be larger than N to obtain a high-density spectrum 
(see zero padding in Section 5.1.1); and Fs is the sampling frequency, which is used for 
plotting purposes. If the window boxcar is used, then we obtain the periodogram estimate. 

The periodogram can be expressed in terms of the autocorrelation estimate 7, (1) of the 
windowed sequence u(n) as (see Problem 5.9) 


N-1 


De? 


I=—(N-1) 


Ry(e!®) = (5.3.5) 


which shows that Ry (e/ “) is a “natural” estimate of the power spectrum. From (5.3.2) 
it follows that Ry (e/”) is nonnegative for all frequencies w. This results from the fact 
that the autocorrelation sequence F(/), 0 < |/| < N — 1, is nonnegative definite. If we 
use the estimate 7, (/) from (5.2.13) in (5.3.5) instead of 7,(/), the obtained periodogram 
may assume negative values, which implies that 7, (/) is not guaranteed to be nonnegative 
definite. 

The inverse Fourier transform of Ry (eJ”) provides the estimated autocorrelation 7, (/), 
that is, 


ae cae eee 
A= 5 [Rule el*! do (5.3.6) 
2m J_x 


because 7, (/) and Ry (e/”) form a DTFT pair. Using (5.3.6) and (5.2.1) for 7 = 0, we have 


I N-1 1 1 
AO =z X lu(n)? = =f. Ry (e!®) dw 


IU 


6:32) 


Thus, the periodogram R .(e/”) shows how the power of the segment {v(n)}q =< which 
provides an estimate of the variance of the process x(n), is distributed as a function of 
frequency. 


Filter bank interpretation. The above assertion that the periodogram describes a dis- 
tribution of power as a function of frequency can be interpreted in a different way, in which 
the power estimate over a narrow frequency band is attributed to the output power of a 
narrow-bandpass filter. This leads to the well-known filter bank interpretation of the pe- 
riodogram. To develop this interpretation, consider the basic (unwindowed) periodogram 
estimator Ry (e/) in (5.3.2), evaluated at a frequency wx £kAw = 22k /N, which can be 
expressed as 


2 2 


N-1 


1 . 
i: > x(nye~JOK" 


n=0 


N-1 


1 : : 
= > x(n)el2™k-Jown 


n=0 


Ry(el) = 


N-1 2 


> x(nyeloK\N—n) 


n=0 


since wyN = 27k (5.3.8) 


2|- 


N-1 2 


> x(N — m)esoxm 


m=0 


2|- 
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CHAPTER 5 of x(n) and e/@k”", evaluated atn = N. Define 
Nonparametric Power 1. 
Spectrum Estimation mn —eloKn O0<n<N-I1 
h(n) £4 N (5.3.9) 
0 otherwise 


as the impulse response of a linear system whose frequency response is given by 
re el 
Flhe@)] = dX a 
n= 


Ay (e/”) 


N-1 : 
! j 1 ec IN(@—oax) —] 
a —j(@—ag)n _ 
N me WN e-j@-o) —] (5.3.10) 
n= 


= 1 sin[N(@ — O)/2) | j(N-1)(o-e4)/2 
N sin[(@ — wx)/2] 
which is a linear-phase, narrow-bandpass filter centered at w = wx. The 3-dB bandwidth 
of this filter is proportional to 277 /N rad per sampling interval (or 1/N cycles per sampling 
interval). A plot of the magnitude response | Hy (e/®)|, for @% = 2/2 and N = 50, is shown 
in Figure 5.10, which evidently shows the narrowband nature of the filter. 


Filter response: w, = 7/2, N = 50 FIGURE 5.10 
0 T The magnitude of the frequency 
response of the narrow-bandpass 
filter for w, = 1/2 and N = 50. 
+10 + A 
ao 
= 
5 ~20F 5 
= 
fe) 
a 
~30 + 4 
olla 
—7 0 am/2 7 


Continuing, we also define the output of the filter hy (m) by yz (7), that is, 
N-1 


1 : 
y(n) © ha(n) * x(n) = = Y | x(n — myelin (5.3.11) 
m=0 
Then (5.3.8) can be written as 
Ry (e/) = Niyx(N)I? (5.3.12) 


Now consider the average power in yg (7), which can be evaluated using the spectral density 
as [see (3.3.45) and (3.4.22)] 


1 7 ; . 
E{lye@P) = = i Ry (e!)| Hye!) 2 do 


Aw , 1 : 
~ a hee = yy Ree) (5.3.13) 


since H;(e/”) is a narrowband filter. If we estimate the average power EF {| yx (7) |7} using one 
sample yx (N), then from (5.3.13) the estimated spectral density is the periodogram given 


by (5.3.12), which says that the kth DFT sample of the periodogram [see (5.3.4)] is given 
by the average power of a single Nth output sample of the w;,-centered narrow-bandpass 
filter. Now imagine one such filter for each a, k = 0,..., N — 1, frequencies. Thus we 
have a bank of filters, each tuned to the discrete frequency (based on the data record length), 
providing the periodogram estimates every N samples. This filter bank is inherently built 
into the periodogram and hence need not be explicitly implemented. The block diagram of 
this filter bank approach to the periodogram computation is shown in Figure 5.11. 


n=N-1 
yo(n) e 
R,(0) 
RN-1) 


FIGURE 5.11 
The filter bank approach to the periodogram computation. 


In Section 5.1, we observed that the periodogram of a deterministic signal approaches 
the true energy spectrum as the number of observations N — oo. To see how the power 
spectrum of random signals is related to the number observations, we consider the following 
example. 


EXAMPLE 5.3.1 (PERIODOGRAM OF A SIMULATED WHITE NOISE SEQUENCE). Let x(n) 
be a stationary white Gaussian noise with zero-mean and unit variance. The theoretical spectrum 
of x(n) is 


Ry(e/®) = 0% =1 —at<o<az 


To study the periodogram estimate, 50 different N-point records of x(n) were generated using a 
pseudorandom number generator. The periodogram R ,(eJ”) of each record was computed for 
@ = wg = 27k /1024,k = 0,1,..., 512, that is, with Nppp = 1024, from the available data 
using (5.3.4) for N = 32, 128, and 256. These results in the form of periodogram overlays (a 
Monte Carlo simulation) and their averages are shown in Figure 5.12. We notice that R x (eJ ®) 
fluctuates so erratically that it is impossible to conclude from its observation that the signal has a 
flat spectrum. Furthermore, the size of the fluctuations (as seen from the ensemble average) is not 
reduced by increasing the segment length N. In this sense, we should not expect the periodogram 
Ry (eJ®) to converge to the true spectrum Ry (e/®) in some statistical sense as N — oo. Since 
Ry (e/ “) is constant over frequency, the fluctuations of R x (e/ ”) can be characterized by their 
mean, variance, and mean square error over frequency for each N and are given in Table 5.2. It 
can be seen that although the mean value tends to 1 (true value), the standard deviation is not 
reduced as N increases. In fact, it is close to 1; that is, itis of the order of the size of the quantity to 
be estimated. This illustrates that the periodogram is not a good estimate of the power spectrum. 


Since for each value of w, Ry (ef “) is a random variable, the erratic behavior of the 
periodogram estimator, which is illustrated in Figure 5.12, can be explained by considering 
its mean, covariance, and variance. 
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TABLE 5.2 
Performance of periodogram for white Gaussian 
noise signal in Example 5.3.1. 


N 32 128 256 
E[Ry (e/@k)] 0.7829 0.8954 0.9963 
var[ Ry (e/%k)] 0.7232 1.0635 1.1762 
MSE 0.7689 1.07244 1.1739 


Periodogram overlay: N = 32 Periodogram average: N = 32 
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FIGURE 5.12 
Periodograms of white Gaussian noise in Example 5.3.1. 


Mean of R, (eJ®), Taking the mathematical expectation of (5.3.5) and using (5.2.8), 
we obtain 


N-1 N-1 


E{R,(e/)}= >) Efe = = Yo rxOrwWet" — (5.3.14) 


I=-(N-1) I=-(N-1) 


Since E {Ry (e/”)} # R, (e/”), the periodogram is a biased estimate of the true power 
spectrum R,(e/®). 


Equation (5.3.14) can be interpreted in the frequency domain as a periodic convolution. 
Indeed, using the frequency domain convolution theorem, we have 


E{R,(e/”)} = ai R,(e!9) Ry (ce @-) do (5.3.15) 


where Ry (e/®) = |W(el®) |? (5.3.16) 


is the spectrum of the window. Thus, the expected value of the periodogram is obtained by 
convolving the true spectrum R, (e/”) with the spectrum R,, (e/”) of the window. This is 
equivalent to windowing the true autocorrelation r, (/) with the correlation or lag window 
ry (1) = w(l) * w(—1), where w(n) is the data window. 

To understand the implications of (5.3.15), consider the rectangular data window 
(5.2.7). Using (5.2.11), we see that (5.3.14) becomes 


N-1 


E{R,(e)}= >> (1- B) we (5.3.17) 


l=—(N-1) 


For nonperiodic autocorrelations, the value of 7, (/) becomes negligible for large values of 
||. Hence, as the record length N increases, the term (1 — |/|/N) — 1 for all 7, which 
implies that 


lim E{R,(e/®)} = Ry (e/”) (5.3.18) 


that is, the periodogram is an asymptotically unbiased estimator of R, (e/”). In the frequency 
domain, we obtain 


: 2 
Ry (e!®) = Flwr(D) * wa(—D} = |We (el)? = eed (5.3.19) 

sin (@/2) 
where Wr (el?) = e JON = ON) (5.3.20) 


sin (@/2) 


is the Fourier transform of the rectangular window. The spectrum R,, (e/), in (5.3.19), of 
the correlation window r,(/) approaches a periodic impulse train as the window length 
increases.’ As a result, E {Ry (e/”)} approaches the true power spectrum R, (e/”) as N 
approaches oo. 


The result (5.3.18) holds for any window that satisfies the following two conditions: 


1. The window is normalized such that 
N-1 
3 |w(n)|? = N (5.3.21) 
n=0 


This condition is obtained by noting that, for asymptotic unbiasedness, we want Ry (e/”) / 
N in (5.3.15) to be an approximation of an impulse in the frequency domain. Since the 
area under the impulse function is unity, using (5.3.16) and Parseval’s theorem, we have 


CH fb piel : 
es W(e!®)|- da = — =1 5.3.22 
ee (el) /? deo rp) (5.3.22) 


2. The width of the mainlobe of the spectrum R,, (e/”) of the correlation window decreases 
as 1/N. This condition guarantees that the area under R,,(e/%) is concentrated at the 
origin as N becomes large. For more precise conditions see Brockwell and Davis (1991). 


"This spectrum is sometimes referred to as the Fejer kernel. 
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The bias is introduced by the sidelobes of the correlation window through leakage, 


as illustrated in Section 5.1. Therefore, we can reduce the bias by using the modified 
periodogram and a “better” window. Bias can be avoided if either N = oo, in which case 
the spectrum of the window is a periodic train of impulses, or R,. (e/®) = a, that is, x(n) 
has a flat power spectrum. Thus, for white noise, R  (e/”) is unbiased for all N. This fact was 
apparent in Example 5.3.1 and is very important for practical applications. In the following 
example, we illustrate that the bias becomes worse as the dynamic range of the spectrum 
increases. 


EXAMPLE 5.3.2 (BIAS AND LEAKAGE PROPERTIES OF THE PERIODOGRAM). Consider 
an AR(2) process with 


ay = [1 —0.75 0.5]7 dg =1 (5.3.23) 
and an AR(4) process with 
a4 = [1 — 2.7607 3.8106 — 2.6535 0.9238] dg = 1 (5.3.24) 


where w(n) ~ WN(0, 1). Both processes have been used extensively in the literature for power 
spectrum estimation studies (Percival and Walden 1993). Their power spectrum is given by (see 
Chapter 4) 


2 2 
: o~.d on 
Ry(e/) = —wo = w (5.3.25) 
|A(e/®)|? P 2 
Y aeiol 
k=0 
For simulation purposes, N = 1024 samples of each process were generated. The sample 


realizations and the shapes of the two power spectra in (5.3.25) are shown in Figure 5.13. The 
dynamic range of the two spectra, that is, max Rx (e/®)/ min Rx (e/®), is about 15 and 65 dB, 
@ w 


respectively. 

From the sample realizations, periodograms and modified periodograms, based on the Han- 
ning window, were computed by using (5.3.4) at Nppr = 1024 frequencies. These are shown in 
Figure 5.14. The periodograms for the AR(2) and AR(4) processes, respectively, are shown in the 
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Sample realizations and power spectra of the AR(2) and AR(4) processes used in Example 5.3.2. 
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FIGURE 5.14 
Illustration of properties of periodogram as a power spectrum estimator. 


top row while the modified periodograms for the same processes are shown in the bottom row. 
These plots illustrate that the periodogram is a biased estimator of the power spectrum. In the 
case of the AR(2) process, since the spectrum has a small dynamic range (15 dB), the bias in the 
periodogram estimate is not obvious; furthermore, the windowing in the modified periodogram 
did not show much improvement. On the other hand, the AR(4) spectrum has a large dynamic 
range, and hence the bias is clearly visible at high frequencies. This bias is clearly reduced by 
windowing of the data in the modified periodogram. In both cases, the random fluctuations are 
not reduced by the data windowing operation. 


EXAMPLE 5.3.3 (FREQUENCY RESOLUTION PROPERTY OF THE PERIODOGRAM). Con- 
sider two unit-amplitude sinusoids observed in unit variance white noise. Let 


x(n) = cos (0.357n + $1) + cos (0.4710 + $2) + v(n) 


where ¢, and @> are jointly independent random variables uniformly distributed over [—z, zr] 
and v(n) is a unit-variance white noise. Since two frequencies, 0.357 and 0.47, are close, we 
will need (see Table 5.1) 


1.817 
> 
0.42 — 0.352 


To obtain a periodogram ensemble, 50 realizations of x(n) for N = 32 and N = 64 were 
generated, and their periodograms were computed. The plots of these periodogram overlays and 
the corresponding ensemble average for N = 32 and N = 64 are shown in Figure 5.15. For 
N = 32, frequencies in the periodogram cannot be resolved, as expected; but for N = 64 it is 
possible to separate the two sinusoids with ease. Note that the modified periodogram (i.e., data 
windowing) will not help since windowing increases smoothing and smearing of peaks. 


N-1 or N > 37 


The case of nonzero mean. In the periodogram method of spectrum analysis in this 


section, we assumed that the random signal has zero mean. If a random signal has nonzero 
mean, it should be estimated using (3.6.20) and then removed from the signal prior to 
computing its periodogram. This is because the power spectrum of a nonzero mean signal 
has an impulse at the zero frequency. If this mean is relatively large, then because of the 
leakage inherent in the periodogram, this mean will obscure low-amplitude, low-frequency 
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components of the spectrum. Even though the estimate is not an exact value, its removal 
often provides better estimates, especially at low frequencies. 


Covariance of R,(e/”). Obtaining an expression for the covariance of the periodogram 
is a rather complicated process. However, it has been shown (Jenkins and Watts 1968) that 


sin [(@) + @2)N/2] | 


cov{Ry (e/°1), Ry (e/#2)} ~ Ry (el!) Ry (0/2) (| Nsin[() + @2)/2] 


(5.3.26) 


| sin [(@1 — w2)N/2] | 
N sin [(@1 — @2)/2] 


This expression applies to stationary random signals with zero mean and Gaussian prob- 
ability density. The approximation becomes exact if the signal has a flat spectrum (white 
noise). Although this approximation deteriorates for non-Gaussian probability densities, 
the qualitative results that one can draw from this approximation appear to hold for a rather 
broad range of densities. 

From (5.3.26), for @, = (27 /N)k, and @2 = (27 /N)kz with k1, kz integers, we have 


cov{Ry(el!)R,(e/)} 0 fork, # ko (5.3.27) 


Thus, values of the periodogram spaced in frequency by integer multiples of 27 /N are ap- 
proximately uncorrelated. As the record length N increases, these uncorrelated periodogram 
samples come closer together, and hence the rate of fluctuations in the periodogram in- 
creases. This explains the results in Figure 5.12. 
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Illustration of the frequency resolution property of the periodogram in Example 5.3.3. 


Variance of R, (e/®). The variance of the periodogram at a particular frequency w = 
@1 = w? can be obtained from (5.3.26) 


anf inwN \2 
var{Ry(el”)} ~ R2(e/) | 1+ (re) (5.3.28) 
N sin @ 
For large values of N, the variance of R, (e/”) can be approximated by 
nh os R2(e/®) 0O<w<a7 
var{R,(e/®)} x (5.3.29) 


2R2(e/®) o=0,0 


This result is crucial, because it shows that the variance of the periodogram (estimate) 
remains at the level of R? (e/”) (quantity to be estimated), independent of the record length 
N used. Furthermore, since the variance does not tend to zero as N — ov, the periodogram 
is not a consistent estimator; that is, its distribution does not tend to cluster more closely 
around the true spectrum as N increases.” 

This behavior was illustrated in Example 5.3.1.The variance of Ry (e/®k) fails to de- 
crease as N increases because the number of periodogram values R x (e/ ce a el I 
N — 1, is always equal to the length N of the data record. 


EXAMPLE 5.3.4 (COMPARISON OF PERIODOGRAM AND MODIFIED PERIODOGRAM). 
Consider the case of three sinusoids discussed in Section 5.1.4. In particular, we assume that 
these sinusoids are observed in white noise with 


x(n) = cos (0.352n + $1) + cos (0.4210 + G2) + 0.25 cos (0.8770 + $3) + v(n) 


where $1, 2, and $3 are jointly independent random variables uniformly distributed over 
[—2, m] and v(n) is a unit-variance white noise. An ensemble of 50 realizations of x(n) was 
generated using N = 128. The periodograms and the Hamming window-based modified peri- 
odograms of these realizations were computed, and the results are shown in Figure 5.16. The 
top row of the figure contains periodogram overlays and the corresponding ensemble average 
for the unwindowed periodogram, and the bottom row shows the same for the modified peri- 
odogram. Spurious peaks (especially near the two close frequencies) in the periodogram have 
been suppressed by the data windowing operation in the modified periodogram; hence the peak 
corresponding to 0.87 is sufficiently enhanced. This enhancement is clearly at the expense of 
the frequency resolution (or smearing of the true peaks), which is to be expected. The overall 
variance of the noise floor is still not reduced. 


Failure of the periodogram 


To conclude, we note that the periodogram in its “basic form” is a very poor estimator 
of the power spectrum function. The failure of the periodogram when applied to random 
signals is uniquely pointed out in Jenkins and Watts (1968, p. 213): 


The basic reason why Fourier analysis breaks down when applied to time series is that it is based 
on the assumption of fixed amplitudes, frequencies and phases. Time series, on the other hand, 
are characterized by random changes of frequencies, amplitudes and phases. Therefore it is not 
surprising that Fourier methods need to be adapted to account for the random nature of a time 
series. 


The attempt at improving the periodogram by windowing the available data, that is, by 
using the modified periodogram in Example 5.3.4, showed that the presence and the length of 
the window had no effect on the variance. The major problems with the periodogram lie in its 
variance, which is on the order of R2 (e/ “), as well as in its erratic behavior. Thus, to obtain a 
better estimator, we should reduce its variance; that is, we should “smooth” the periodogram. 


‘The definition of the PSD by Ry (e/®) = limy-soo Ry (e/”) is not valid because even if lim y_so0 E{Ry (eJ®)} = 
Rx (e/®), the variance of Ry (e/) does not tend to zero as N —> oo (Papoulis 1991). 
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From the previous discussion, it follows that the sequence R x(k), k = 0,1,...,N —1, 
of the harmonic periodogram components can be reasonably assumed to be a sequence of 
uncorrelated random variables. Furthermore, it is well known that the variance of the sum 
of K uncorrelated random variables with the same variance is 1/K times the variance of 
one of these individual random variables. This suggests two ways of reducing the variance, 
which also lead to smoother spectral estimators: 


e Average contiguous values of the periodogram. 
e Average periodograms obtained from multiple data segments. 


It should be apparent that owing to stationarity, the two approaches should provide compa- 
rable results under similar circumstances. 


5.3.2 Power Spectrum Estimation by Smoothing a Single Periodogram— 
The Blackman-Tukey Method 


The idea of reducing the variance of the periodogram through smoothing using a moving- 
average filter was first proposed by Daniel (1946). The estimator proposed by Daniel is a 
zero-phase moving-average filter, given by 


So Ree) SY WR (eT) (6.3.30) 


=-—M j=-M 


Re ag 
j 


where wm, = (27 /N)k,k =0,1,...,N —1, W(el®i) & 1/(2M + 1), and the superscript 
(PS) denotes periodogram smoothing. Since the samples of the periodogram are approxi- 


Periodogram overlay: N = 128 


0.357 0.477 0.877 7 


mately uncorrelated, 


var{ RS) (e/@k)} ~ var{ R, (e/?*)} (5.3.31) 


2M +1 

that is, averaging 2M + 1 consecutive spectral lines reduces the variance by a factor of 
2M + 1. The quantity Aw © (22 /N)(2M + 1) determines the frequency resolution, since 
any peaks within the Aw range are smoothed over the entire interval Aw into a single peak 
and cannot be resolved. Thus, increasing M reduces the variance (resulting in a smoother 
spectrum estimate), at the expense of spectral resolution. This is the fundamental tradeoff 
in practical spectral analysis. 


Blackman-Tukey approach 

The discrete moving average in (5.3.30) is computed in the frequency domain. We now 
introduce a better and simpler way to smooth the periodogram by operating on the estimated 
autocorrelation sequence. To this end, we note that the continuous frequency equivalent of 
the discrete convolution formula (5.3.30) is the periodic convolution 


A ‘ 1 ers F ; A ; ; 
RY) (e/%) = = / Ry(el@- ) Walel?) dO = Rx(e!”) ® Wale!) (5.3.32) 


—T 


where W,(e/) is a periodic function of w with period 27, given by 


1 Aw 
eRe Oe 
Wiel ate (5.3.33) 
0 ae Onn 
7 Ses 
By using the convolution theorem, (5.3.32) can be written as 
+ Eh . 
RIVE = - So. Round” (5.3.34) 


I=—(L-1) 


where w,(/) is the inverse Fourier transform of Wz (e/”) and L < N. As we have already 
mentioned, the window w,(/) is known as the correlation or lag window.’ The correlation 
window corresponding to (5.3.33) is 
sin (JAqw/2) 

rl 
Since w,(/) has infinite duration, its truncation at |/| = L < N creates ripples in Wy (e/®) 
(Gibbs effect). To avoid this problem, we use correlation windows with finite duration, that 


is, Wa(l) = 0 for || > L < N. For real sequences, where Fr, () is real and even, w,(/) [and 
hence W,(e/”)] should be real and even. Given that R,(e/®) is nonnegative, a sufficient 


Wall) = 


co <1 <0 (5.3.35) 


(but not necessary) condition that R&S) (e/”) be nonnegative is that W, (e/”) > O for all w. 
This condition holds for the Bartlett (triangular) and Parzen (see Problem 5.11) windows, 
but it does not hold for the Hamming, Hanning, or Kaiser window. 

Thus, we note that smoothing the periodogram R,(e/”) by convolving it with the 
spectrum Wy (e/®) = Flwg (1)} is equivalent to windowing the autocorrelation estimate 
7, (1) with the correlation window wa(l). This approach to power spectrum estimation, 
which was introduced by Blackman and Tukey (1959), involves the following steps: 


"The term spectral window is quite often used for Wa (eJ®) = F{wg(I)}, the Fourier transform of the correlation 
window. However, this term is misleading because Wz (e/”) is essentially a frequency-domain impulse response. 
We use the term correlation window for wa(l) and the term Fourier transform of the correlation window for 
Wa (es). 
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1. Estimate the autocorrelation sequence from the unwindowed data. 
2. Window the obtained autocorrelation samples. 
3. Compute the DTFT of the windowed autocorrelation as given in (5.3.34). 


A pictorial comparison between the theoretical [i-e., using (5.3.32)] and the above practi- 
cal computation of power spectrum using the single-periodogram smoothing is shown in 
Figure 5.17. 


0 Signal data record N-1 


THEORY PRACTICE 


Periodogram: R,(e/”) Autocorrelation: {7,(/ Jeon 
Convolution: Windowing: 
R(e!”) &) W,(e!”) 


(Fw) gy 
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DFT using FFT 


ee 
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COMPUTE 


Nepr 
FIGURE 5.17 
Comparison of the theory and practice of the Blackman-Tukey 
method. 


The resolution of the Blackman-Tukey power spectrum estimator is determined by the 
duration 2L — 1 of the correlation window. For most correlation windows, the resolution 
is measured by the 3-dB bandwidth of the mainlobe, which is on the order of 27 /L rad per 
sampling interval. 

The statistical quality of the Blackman-Tukey estimate R&S) (e/”) can be evaluated by 
examining its mean, covariance, and variance. 


Mean of RES) (e/). The expected value of the smoothed periodogram RPS) (e/”) 
can be obtained by using (5.3.34) and (5.2.11). Indeed, we have 


L-1 
EIRY (el) = YY Ef? DJwa Del 


ap (5.3.36) 


= Sao (.- 2) walle 1" 


I=-(L-1) 
or, using the frequency convolution theorem, we have 


E{R®S) (e/”)} = Ry(e/”) @ Wael”) @ Wael”) (5.3.37) 


all 


where Wale?) =F | (1 7) ue | = 


: 2 
Ea ae ON} | (5.3.38) 
N 


N | sin (@/2) 


is the Fourier transform of the Bartlett window. Since E{ RES) (e/®)} £ R,(e/®), RY) (e/®) 
is a biased estimate of R,(e/®). 
For L<N, (1 — |l|/N) ~ 1 and hence we obtain 


L-1 
ERM} = SD nO (.- 2) wallet! 


1=—(L-1) 


~ R,(e!”) @ Wa (e!”) (5.3.39) 


1 u , : 
a i Ry (el?) Wale! @-) a0 
QI jae 


If L is sufficiently large, the correlation window w,(/) consists of a narrow mainlobe. If 
R,(e/”) can be assumed to be constant within the mainlobe, we have 


a ‘ r if a F 
E{R®) (e/®)} ~ Ry (e/°) — / Wi (ef @-®) do 
Q70 fix 


which implies that R{"°) (e/”) is asymptotically unbiased if 
1 


rs 
—_ Wa (e/°) dw = wa(0) = 1 (5.3.40) 
20 J 


that is, if the spectrum of the correlation window has unit area. Under this condition, if both 
L and N tend to infinity, then W,(e/®) and Wg(e/®) become periodic impulse trains and 
the convolution (5.3.37) reproduces R, (e/®). 


Covariance of RES) (e/®). The following approximation 


x 5 A , 1 7 . . ; 
cov{ R&S) (e/#1), RPS) (e/@2)} ~ —__ / R2(e!°) Wa (el!) Wa (el 2) do 
20N Jen 
(5.3.41) 


derived in Jenkins and Watts (1968), holds under the assumptions that (1) N is sufficiently 
large that Wg(e/) behaves as a periodic impulse train and (2) L is sufficiently large that 
W, (e/®) is sufficiently narrow that the product Wz (e/@!+%) W,(e/@2-%) is negligible. 
Hence, the covariance increases proportionally to the width of W,(e/”), and the amount of 
overlap between the windows W, (e/(1—®)) (centered at w1) and W, (e/@2-) (centered 
at w2) increases. 


Variance of R&® (ei), When w = 1 = 2, (5.3.41) gives 


x ; 1 a : : 
var{R®S) (e/®)} ~ —— / R2(e)°) W2(e/@-®) do (5.3.42) 
2nN Jz 
If R,(e/”) is smooth within the width of W,(e/”), then 
A . ; 1 ae : 
var{RPS) (e/®)} ~ Ry(e) [ : W2(e!”) dw (5.3.43) 
A (PS), j@ Ew 2(,Jo 
or var{ Ry?’ (e/%)} = yy Rete ) O<w<az (5.3.44) 


1 L-1 


us 
where | a oe fe W2(e!”) do = 2S w2(L) (5.3.45) 
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is the energy of the correlation window. From (5.3.29) and (5.3.44) we have 


var{RE(e/*)} Ew 
var{Ry(eJ®)} NN 


O<wo<z (5.3.46) 


which is known as the variance reduction factor or variance ratio and provides the reduction 
in variance attained by smoothing the periodogram. 

In the beginning of this section, we explained the variance reduction in terms of 
frequency-domain averaging. An alternative explanation can be provided by considering 
the windowing of the estimated autocorrelation. As discussed in Section 5.2, the variance of 
the autocorrelation estimate increases as |/| approaches N because fewer and fewer samples 
are used to compute the estimate. Since every value of 7, (/) affects the value of Ry (w) at 
all frequencies, the less reliable values affect the quality of the periodogram everywhere. 
Thus, we can reduce the variance of the periodogram by minimizing the contribution of 
autocorrelation terms with large variance, that is, with lags close to NV, by proper windowing. 

As we have already stressed, there is a tradeoff between resolution and variance. For 
the variance to be small, we must choose a window that contains a small amount of energy 
Ew. Since |wg(l)| < 1, we have Ey, < 2L. Thus, to reduce the variance, we must have 
L<N. The bias of R&S) (e/”) is directly related to the resolution, which is determined by 
the mainlobe width of the window, which in turn is proportional to 1/L. Hence, to reduce 
the bias, W, (e/®) should have a narrow mainlobe that demands a large L. The requirements 
for high resolution (small bias) and low variance can be simultaneously satisfied only if NV 
is sufficiently large. The variance reduction for some commonly used windows is examined 
in Problem 5.12. Empirical evidence suggests that use of the Parzen window is a reasonable 
choice. 


Confidence intervals. Inthe interpretation of spectral estimates, it is important to know 
whether the spectral details are real or are due to statistical fluctuations. Such information 
is provided by the confidence intervals (Chapter 3). When the spectrum is plotted on a loga- 
rithmic scale, the (1 — w) x 100 percent confidence interval is constant at every frequency, 
and it is given by (Koopmans 1974) 


re 
xv - 2) (5.3.47) 


(10 log RS) (e/) — 10 log , 10 log R°S)(e/) + 10 log 


Fe) 
xp (@/2) 


2N 


where YS (5.3.48) 


YY wi® 


l=—(L—1) 


is the degrees of freedom of a x? distribution. 


Computation of RP) (e/) using the DFT. In practice, the Blackman-Tukey power 
spectrum estimator is computed by using an N-point DFT as follows: 


1. Estimate the autocorrelation 7; (/), using the formula 


N+I-1 
1 
AO =A D= > DL x+dx*@) 1=0,1,...,L-1 63.49) 
n=0 


For L > 100, indirect computation of r,.(/) by using DFT techniques is usually more 
efficient (see Problem 5.13). 


2. Form the sequence 
7 (wall) O</1<L-1 
fO= 40 L<Il<N-L (5.3.50) 
PE(N —1)wa(N —1) N-L+1</I1<N-1 
3. Compute the power spectrum estimate 
RS) (ce!) | Won /N)k = F(k) =DFT{fM} O<k<N-1 (5.3.51) 
as the N-point DFT of the sequence f (J). 


MATLAB does not provide a direct function to implement the Blackman-Tukey method. 
However, such a function can be easily constructed by using built-in MATLAB functions 
and the above approach. The book toolbox function 


Rx = bt_psd(x,Nfft, window, L) ; 


implements the above algorithm in which window is any available MATLAB window and 
Nfft is chosen to be larger than N to obtain a high-density spectrum. 


EXAMPLE 5.3.5 (BLACKMAN-TUKEY METHOD). Consider the spectrum estimation of three 
sinusoids in white noise given in Example 5.3.4, that is, 


x(n) = cos (0.357rn + $1) + cos (0.42n + $2) + 0.25 cos (0.870 + $3) + v(n) (5.3.52) 


where ¢), $2, and $3 are jointly independent random variables uniformly distributed over 
[—2, 7] and v(n) is a unit-variance white noise. An ensemble of 50 realizations of x(n) was 
generated using N = 512. The autocorrelations of these realizations were estimated up to 
lag L = 64, 128, and 256. These autocorrelations were windowed using the Bartlett window, 
and then their 1024-point DFT was computed as the spectrum estimate. The results are shown in 
Figure 5.18. The top row of the figure contains estimate overlays and the corresponding ensemble 
average for L = 64, the middle row for L = 128, and the bottom row for L = 256. Several 
observations can be made from these plots. First, the variance in the estimate has considerably 
reduced over the periodogram estimate. Second, the lower the lag distance L, the lower the 
variance and the resolution (i.e., the higher the smoothing of the peaks). This observation is 
consistent with our discussion above about the effect of L on the quality of estimates. Finally, all 
the frequencies including the one at 0.87 are clearly distinguishable, something that the basic 
periodogram could not achieve. 


5.3.3 Power Spectrum Estimation by Averaging Multiple Periodograms— 
The Welch-Bartlett Method 


As mentioned in Section 5.3.1, in general, the variance of the sum of K ID random variables 
is 1/K times the variance of each of the random variables. Thus, to reduce the variance 
of the periodogram, we could average the periodograms from K different realizations of a 
stationary random signal. However, in most practical applications, only a single realization 
is available. In this case, we can subdivide the existing record {x(n),0 <n < N — l} into 
K (possibly overlapping) smaller segments as follows: 


xj(n) = x(@iD+n)w(n) O<n<L-1,0<i<kK-1 (5.3.53) 


where w(n) is a window of duration L and D is an offset distance. If D < L, the segments 
overlap; and for D = L, the segments are contiguous. The periodogram of the ith segment 
is 

2 


x : 1 ; 
Ry i(e/%) pixie’ = (5.3.54) 


L-1 
Sue i” 
n=0 


1 
L 
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We remind the reader that the window w(n) in (5.3.53) is called a data window because 
it is applied directly to the data, in contrast to a correlation window that is applied to the 
autocorrelation sequence [see (5.3.34)]. Notice that there is no need for the data window 
to have an even shape or for its Fourier transform to be nonnegative. The purpose of using 
the data window is to control spectral leakage. 


The spectrum estimate RPA (e/®) is obtained by averaging K periodograms as follows: 
, ko! 1 ko 
(PA) (J D jw 2 
RPA) (ei) & - dX Ry (el) = a dX |X; (e/”)| (5.3.55) 
i= i= 


where the superscript (PA) denotes periodogram averaging. To determine the bias and 
variance of RPA) (e/”), we let D = L so that the segments do not overlap. The so-computed 
estimate REA (e/”) is known as the Bartlett estimate. We also assume that r,(/) is very 
small for |/| > L.This implies that the signal segments can be assumed to be approximately 
uncorrelated. To show that the simple periodogram averaging in Bartlett’s method reduces 
the periodogram variance, we consider the following example. 


B-T spectrum estimate overlay: L = 64 B-T spectrum estimate average: L = 64 
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Spectrum estimation of three sinusoids in white noise using the Blackman-Tukey method in 
Example 5.3.5. 


EXAMPLE 5.3.6 (PERIODOGRAM AVERAGING). Let x(n) be a stationary white Gaussian 
noise with zero mean and unit variance. The theoretical spectrum of x(n) is 


Ry (ef) = 0% =1 —-mt<o<m 


An ensemble of 50 different 512-point records of x(n) was generated using a pseudorandom 
number generator. The Bartlett estimate of each record was computed for K = 1 (i.e., the basic 
periodogram), K = 4 (or L = 128), and K = 8 (or L = 64). The results in the form of estimate 
overlays and averages are shown in Figure 5.19. The effect of periodogram averaging is clearly 
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evident. 
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Spectral estimation of white noise using Bartlett’s method in Example 5.3.6. 


Mean of RPA (e/®). The mean value of REA (e/”) is 


K-1 
K . 1 7 . A : 
E(RE™ (e/°)) = =D) E( Re i(e!®)} = E{Rx(e!”)} (5.3.56) 
i=0 
where we have assumed that E{Ry i (e/”)} = E{R,(e/®)} because of the stationarity 


assumption. From (5.3.56) and (5.3.15), we have 


E{RP% (e/)) = E{Ry(e/”)} = = / ” Ree) Ry(ei@-))\d0 (5.3.57) 
a —1 
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where R,,(e/”) is the spectrum of the data window w(n). Hence, REA (e/”) is a biased 
estimate of R,(e/”). However, if the data window is normalized such that 


L-1 
se w*(n) = L (5.3.58) 
n=0 


the estimate RPA) (e/”) becomes asymptotically unbiased [see the discussion following 
equation (5.3.15)]. 


Variance of R&) (e/”). The variance of R& (e/”) is 
A ; 1 RB : 
var{ RO") (e/®)} = ra var{R,(e/”)} (5.3.59) 
(assuming segments are independent) or using (5.3.29) gives 


Fs ; 1 ; 
var{RO*) (e/®)} ~ ghee’) (5.3.60) 


Clearly, as K increases, the variance tends to zero. Thus, REA (e/®) provides an asymptot- 
ically unbiased and consistent estimate of R, (e/). If N is fixed and N = KL, we see that 
increasing K to reduce the variance (or equivalently obtain a smoother estimate) results in 
a decrease in L, that is, a reduction in resolution (or equivalently an increase in bias). 
When w (7) in (5.3.53) is the rectangular window of duration L, the square of its Fourier 
transform is equal to the Fourier transform of the triangular sequence wt(n) = L—|I|, || < 
L, which when combined with the 1/Z factor in (5.3.57), results in the Bartlett window 


1 
= Ml | < L 
wp(l) = L (5.3.61) 
0 elsewhere 
' 1 [sin (wL/2)1° 
with Wait | ee) (5.3.62) 
L {| sin (@/2) 


This special case of averaging multiple nonoverlapping periodograms was introduced by 
Bartlett (1953). 

The method has been extended to modified overlapping periodograms by Welch (1970), 
who has shown that the shape of the window does not affect the variance formula (5.3.59). 
Welch showed that overlapping the segments by 50 percent reduces the variance by about 
a factor of 2, owing to doubling the number of segments. More overlap does not result 
in additional reduction of variance because the data segments become less and less inde- 
pendent. Clearly, the nonoverlapping segments can be uncorrelated only for white noise 
signals. However, the data segments can be considered approximately uncorrelated if they 
do not have sharp spectral peaks or if their autocorrelations decay fast. 

Thus, the variance reduction factor for the spectral estimator REA) (e/®) is 


var{Re(e/”)} 
var{Ry(e/®)}  K 


and is reduced by a factor of 2 for 50 percent overlap. 


O<w<z (5.3.63) 


Confidence intervals. The (1 — w) x 100 percent confidence interval on a logarithmic 
scale may be shown to be (Jenkins and Watts 1968) 


5 (1—a/2 2 2K 
Xax( / ) 10 log R&* (e/) + 10 log a ase 
2K 5K (a@/2) 


(5.3.64) 


(1010 Ae 10log 


where x x is achi-squared distribution with 2K degrees of freedom. 


Computation of REX (e/”) using the DFT. In practice, to compute RE”) (e/®) at L 231 
equally spaced frequencies w, = 27k/L,0 < k < L —1, the method of periodogram SECTION 5.3 
averaging can be easily and efficiently implemented by using the DFT as follows (we have Estimation of the Power 


assumed that L is even): ee re 


1. Segment data {x(n)}Q! into K segments of length L, each offset by D duration using 
xj(n) = x(iD +n) O0<i<K-1,0<n<L-Il (5.3.65) 


If D = L, there is no overlap; and if D = L/2, the overlap is 50 percent. 
2. Window each segment, using data window w(n) 


xj(n) = xj(n)w(n) = xD + n)w(n) O0<i<kK-1,0<n<L-1l (5.3.66) 
3. Compute the N-point DFTs X;(k) of the segments x;(1),0 <i < K —1, 


L-1 
Xi(k) = x gw (CVO O4ke b—-1lv<t<K 1 (5.3.67) 
n=0 


4. Accumulate the squares (x i(k)|? 
Si(k) & 2 |X (k)|? O0<k<L/2 (5.3.68) 
5. Finally, normalize by KL to obtain the estimate REA) (k): 


1p (PA) 1 re 
RM WD=—7 50) OskeN/ (5.3.69) 
i=0 


At this point we emphasize that the spectrum estimate REA (k) is always nonnegative. 
A pictorial description of this computational algorithm is shown in Figure 5.20. A more 


efficient way to compute REA (k) is examined in Problem 5.14. 


Offset 


Signal data 
1 record 


Segment 1 H 
Segment 2 


ee | Periodogram 2 | 


| Periodogram K | | Periodogram K | 


feces Periodogram 1 


Neer 


Averaging 


PSD Estimate 


FIGURE 5.20 
Pictorial description of the Welch-Bartlett method. 


232 


CHAPTER 5 
Nonparametric Power 
Spectrum Estimation 


In MaTLas the Welch-Bartlett method is implemented by using the function 
Rx = psd(x,Nfft,Fs, window (L),Noverlap,’none’); 


where window is the name of any MATLAB-provided window function (e.g., hamming); Nfft 
is the size of the DFT, which is chosen to be larger than L to obtain a high-density spectrum; 
Fs is the sampling frequency, which is used for plotting purposes; and Noverlap specifies 
the number of overlapping samples. If the boxcar window is used along with Noverlap=0, 
then we obtain Bartlett’s method of periodogram averaging. (Note that Noverlap is different 
from the offset parameter D given above.) If Noverlap=L/2 is used, then we obtain Welch’s 
averaged periodogram method with 50 percent overlap. 

A biased estimate 7,.(/), |/| < L, of the autocorrelation sequence of x(n) can be ob- 


tained by taking the inverse N-point DFT of REA) (k) if N > 2L — 1. Since only samples 
of the continuous spectrum RPA) (e/”) are available, the obtained autocorrelation sequence 


pA) (J) is an aliased version of the true autocorrelation r,(/) of the signal x(m) (see Prob- 
lem 5.15). 


EXAMPLE 5.3.7 (BARTLETT’S METHOD). Consider again the spectrum estimation of three 
sinusoids in white noise given in Example 5.3.4, that is, 


x(n) = cos (0.357 + $1) + cos (0.471n + G2) + 0.25 cos (0.8710 + $3) + v(n) (5.3.70) 


where $1, $2, and $3 are jointly independent random variables uniformly distributed over 
[—2, a] and v(n) is a unit-variance white noise. An ensemble of 50 realizations of x(n) was 
generated using N = 512. The Bartlett estimate of each ensemble was computed for K = 1 
(i.e., the basic periodogram), K = 4 (or L = 128), and K = 8 (or L = 64). The results in the 
form of estimate overlays and averages are shown in Figure 5.21. Observe that the variance in 
the estimate has consistently reduced over the periodogram estimate as the number of averaging 
segments has increased. However, this reduction has come at the price of broadening of the 
spectral peaks. Since no window is used, the sidelobes are very prominent even for the K = 8 
segment. Thus confidence in the w = 0.8z spectral line is not very high for the K = 8 case. 


EXAMPLE 5.3.8 (WELCH’S METHOD). Consider Welch’s method for the random process in 
the above example for N = 512, 50 percent overlap, and a Hamming window. Three different 
values for L were considered; L = 256 (3 segments), L = 128 (7 segments), and L = 64 (15 
segments). The estimate overlays and averages are shown in Figure 5.22. In comparing these 
results with those in Figure 5.21, note that the windowing has considerably reduced the spurious 
peaks in the spectra but has also further smoothed the peaks. Thus the peak at 0.87 is recognizable 
with high confidence, but the separation of two close peaks is not so clear for L = 64. However, 
the L = 128 case provides the best balance between separation and detection. On comparing the 
Blackman-Tukey (Figure 5.18) and Welch estimates, we observe that the results are comparable 
in terms of variance reduction and smoothing aspects. 


5.3.4 Some Practical Considerations and Examples 


The periodogram and its modified version, which is the basic tool involved in the estimation 
of the power spectrum of stationary signals, can be computed either directly from the signal 
samples {x (n)}y using the DTFT formula 


N-1 


y w(n)x(n)e~ 12" 


~ 1 
Ry(e!®) = WN 
n=0 


(5.3.71) 


or indirectly using the autocorrelation sequence 
N-1 
R,(e!”) = y 7, (De I?! (5.3.72) 
I=-(N-1) 


where 7; (/) is the estimated autocorrelation of the windowed segment {w(n)x (n)}q. —! The 
periodogram R,.(e/”) provides an unacceptable estimate of the power spectrum because 


1. it has a bias that depends on the length N and the shape of the data window w(n) and 
2. its variance is equal to the true spectrum R,(e/”). 


Given a data segment of fixed duration N, there is no way to reduce the bias, or 
equivalently to increase the resolution, because it depends on the length and the shape of the 
window. However, we can reduce the variance either by averaging the single periodogram 
of the data (method of Blackman-Tukey) or by averaging multiple periodograms obtained 
by partitioning the available record into smaller overlapping segments (method of Bartlett- 
Welch). 

The method of Blackman-Tukey is based on the following modification of the indirect 
periodogram formula 

L-1 
RPC = So A Owalle i (5.3.73) 
I=-(L-1) 


which basically involves windowing of the estimated autocorrelation (5.2.1) with a proper 


Bartlett estimate average: K = 1 
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Estimation of three sinusoids in white noise using Bartlett’s method in Example 5.3.7. 
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correlation window. Using only the first L < N more-reliable values of the autocorrelation 
sequence reduces the variance of the spectrum estimate by a factor of approximately L/N. 
However, at the same time, this reduces the resolution from about 1/N to about 1/L. The 
recommended range for L is between 0.1N and 0.2N. 

The method of Bartlett-Welch is based on partitioning the available data record into 
windowed overlapping segments of length L, computing their periodograms by using the 
direct formula (5.3.71), and then averaging the resulting periodograms to compute the 
estimate 


L=1 
Reds: awl , 
REY CC!) = ee | me/" (5.3.74) 
n=0 


whose resolution is reduced to approximately 1 /Z and whose variance is reduced by a factor 
of about 1/K, where K is the number of segments. 

The reduction in resolution and variance of the Blackman-Tukey estimate is achieved 
by “averaging” the values of the spectrum at consecutive frequency bins by windowing 
the estimated autocorrelation sequence. In the Bartlett-Welch method, the same effect is 
achieved by averaging the values of multiple shorter periodograms at the same frequency 


Welch estimate overlay: L = 256 Welch estimate average: L = 256 
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Estimation of three sinusoids in white noise using Welch’s method in Example 5.3.8. 


bin. The PSD estimation methods and their properties are summarized in Table 5.3. The 
multitaper spectrum estimation method given in the last column of Table 5.3 is discussed 


in Section 5.5. 


TABLE 5.3 


Comparison of PSD estimation methods. 


Single-periodogram smoothing 


Multiple-periodogram 


235 


SECTION 5.3 
Estimation of the Power 
Spectrum of Stationary 
Random Signals 


Periodogram (Blackman-Tukey): averaging (Bartlett-Welch): Multitaper (Thomson): 
R(e/”) RY) (el) RY ei”) RM el”) 
Description Compute DFT Compute DFT of windowed Split record into K segments Window data record using 


of the method 


of data record 


autocorrelation estimate 
(see Figure 5.17) 


and average their modified 


periodograms (see Figure 5.20) 


K orthonormal tapers and 
average their periodograms 
(see Figure 5.30) 


Basic idea Natural estimator of Local smoothing of Overlap data records For properly designed 
Rx (e/®); the error R,(e/”) by weighting 7, (1) to create more segments; orthogonal tapers, 
Irx(D) — Fx (D| is with a lag window wa (/) window segments to reduce periodograms are 
large for large |/| bias; average periodograms independent at each 
to reduce variance frequency. Hence averaging 
reduces variance 
Bias Severe for small NV; Asymptotically unbiased Asymptotically unbiased Negligible for 
negligible for large N properly designed tapers 
; 1 IE ests : 1 1 
Resolution x — «x —, L is maximum lag a — x — 
N L ; N 
L is segment length 
E R2 jo R2 jo 
Variance Unacceptable: about R2 (eJ®) x az Ale) ella 


R2 (e/®) for all N K is number of segments K is number of tapers 


EXAMPLE 5.3.9 (COMPARISON OF BLACKMAN-TUKEY AND WELCH-BARTLETT METHODS). 
Figure 5.23 illustrates the properties of the power spectrum estimators based on autocorrelation 
windowing and periodogram averaging using the AR(4) model (5.3.24). The top plots show the 
power spectrum of the process. The left column plots show the power spectrum obtained by 
windowing the data with a Hanning window and the autocorrelation with a Parzen window of 
length L = 64, 128, and 256. We notice that as the length of the window increases, the resolution 
decreases and the variance increases. We see a similar behavior with the method of averaged 
periodograms as the segment length L increases from 64 to 256. Clearly, both methods give 
comparable results if their parameters are chosen properly. 


Example of ocean wave data. To apply spectrum estimation techniques discussed in 
this chapter to real data, we will use tworeal-valued time series that are obtained by recording 
the height of ocean waves as a function of time, as measured by two wave gages of different 
designs. These two series are shown in Figure 5.24. The top graph shows the wire wave gage 
data while the bottom graph shows the infrared wave gage data. The frequency responses 
of these gages are such that—mainly because of its inertia—frequencies higher than | Hz 
cannot be reliably measured. The frequency range between 0.2 and 1 Hz is also important 
because the rate at which the spectrum decreases has a physical model associated with it. 
Both series were collected at a rate of 30 samples per second. There are 4096 samples in 
each series.’ We will also use these data to study joint signal analysis in the next section. 


"These data were collected by A. Jessup, Applied Physics Laboratory, University of Washington. It was obtained 
from StatLib, a statistical archive maintained by Carnegie Mellon University. 
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Illustration of the properties of the power spectrum estimators using autocorrelation 
windowing (left column) and periodogram averaging (right column) in Example 5.3.9. 


EXAMPLE 5.3.10 (ANALYSIS OF THE OCEAN WAVE DATA). Figure 5.25 depicts the peri- 
odogram averaging and smoothing estimates of the wire wave gage data. The top row of plots 
shows the Welch estimate using a Hamming window, L = 256, and 50 percent overlap between 
segments. The bottom row shows the Blackman-Tukey estimate using a Bartlett window and a 
lag length of L = 256. In both cases, a zoomed view of the plots between 0 and 1 Hz is shown in 
the right column to obtain a better view of the spectra. Both spectral estimates provide a similar 
spectral behavior, especially over the frequency range of 0 to 1 Hz. Furthermore, both show a 
broad, low-frequency peak at 0.13 Hz, corresponding to a period of about 8 s. The dominant 
features of the time series thus can be attributed to this peak and other features in the 0- to 0.2-Hz 
range. The shape of the spectrum between 0.2 and 1 Hzis adecaying exponential and is consistent 
with the physical model. Similar results were obtained for the infrared wave gauge data. 
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FIGURE 5.24 
Display of ocean wave data. 
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Spectrum estimation of the ocean wave data using the Welch and Blackman-Tukey methods. 


5.4 JOINT SIGNAL ANALYSIS 


Until now, we discussed estimation techniques for the computation of the power spectrum 
of one random process x(n), which is also known as univariate spectral estimation. In 
many practical applications, we have two jointly stationary random processes and we wish 
to study the correlation between them. The analysis and computation of this correlation 
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and the associated spectral quantities are similar to those of univariate estimation and are 
called bivariate spectrum estimation. In this section, we provide a brief overview of this 
joint signal analysis. 

Let x(n) and y(n) be two zero-mean, jointly stationary random processes with power 
spectra Ry (e/”) and Ry (e/®), respectively. Then from (3.3.61), the cross-power spectral 
density of x(7) and y(7) is given by 

[o,@) 
R= So alee" (5.4.1) 


l=—0o 


where ryy(/) is the cross-correlation sequence between x(n) and y(n). The cross-spectral 
density Ryy (e/®) is, in general, a complex-valued function that is difficult to interpret or 
plot in its complex form. Therefore, we need to express it by using real-valued functions 
that are easier to deal with. It is customary to express the conjugate of Ryy (e/®) in terms 
of its real and imaginary components, that is, 


Rry(e!®) = Cxy(@) — j Oxy(@) (5.4.2) 

where Cxy(@) & Re [Rxy (e/”)] (5.4.3) 
is called the cospectrum and 

Oxy(w) = Im [R%, (e/)] = — Im[Riy(e/®)] (5.4.4) 


is called the quadrature spectrum. Alternately, the most popular approach is to express 
Ryy(e/®) in terms of its magnitude and angle components, that is, 


Ryy (e/”) = Axy (w) explj xy (w)] (5.4.5) 
where Axy(@) = [Ryy(e/)| = (C3) + 02,(w) (5.4.6) 
and @yy(@) = LRyy(e/”) = tan” !{—Oyy(@)/Cyy(@)} (5.4.7) 


The magnitude A, y(w) is called the cross-amplitude spectrum, and the angle ®xy(q) is 
called the phase spectrum. All these derived functions are real-valued and hence can be ex- 
amined graphically. However, the phase spectrum has the 277 ambiguity in its computation, 
which makes its interpretation somewhat problematic. 

From (3.3.64) the normalized cross-spectrum, called the complex coherence function, 
is given by 


Ryy (e/®) 


V/ Ry (e/®)Ry (e/®) 


which is a complex-valued frequency-domain correlation coefficient that measures the 
correlation between the random amplitudes of the complex exponentials with frequency 
@ in the spectral representations of x(n) and y(n). Hence to interpret this coefficient, its 
magnitude |G,y(@)| is computed, which is referred to as the coherency spectrum. Recall 
that in Chapter 3, we called |G, (w)|? the magnitude-squared coherence (MSC). Clearly, 
O < |Gxy(@)| < 1. Since the coherency spectrum captures the amplitude spectrum but 
completely ignores the phase spectrum, in practice, the coherency and the phase spectrum 
are useful real-valued summaries of the cross-spectrum. 


Gry (@) = 


(5.4.8) 


5.4.1 Estimation of Cross Power Spectrum 


Now we apply the techniques developed in Section 5.3 to the problem of estimating the 
cross-spectrum and its associated real-valued functions. Let {x(n), y(n)}q —! be the data 


record available for estimation. By using the periodogram (5.3.5) as a guide, the estimator 
for Ryy(e/®) is the cross-periodogram given by 


N-1 
Rae) = Se ee (5.4.9) 
i=-(N-1) 
} N-I-1 
— SY x(n+Dy*(n) O<I/<N-1 
N = 
where 7y,(/) = 4 | N+il-1 (5.4.10) 
_ = 2 ayn) —(N -1) <I <-1 
N n=0 
0 1<—-Norl>N 


In analogy to (5.3.2), the cross-periodogram can also be written as 


N-1 N-1 * 
Ry (e/®) = ~ » vine] [> sine] (5.4.11) 
n=0 n=0 


Once again, it can be shown that the bias and variance properties of the cross-periodogram 
are as poor as those of the periodogram. Another disturbing result of these periodograms is 
that from (5.4.11) and (5.3.2), we obtain 


1\2 {MeL 21N-1 2 
[Rry(e/®)? = (5) Y xenye J") |S" yore I"! = Re (e!”) Ry (e?”) 
n=0 n=0 


which implies that if we estimate the MSC from the “raw” autoperiodograms as well as 
cross-periodograms, then the result is always unity for all frequencies. This seemingly un- 
reasonable result is due to the fact that the frequency-domain correlation coefficient at each 
frequency w is estimated by using only one single pair of observations from the two signals. 
Therefore, a reasonable amount of smoothing in the periodogram is necessary to reduce the 
inherent variability of the cross-spectrum and to improve the accuracy of the estimated co- 
herency. This variance reduction can be achieved by straightforward extensions of various 
techniques discussed in Section 5.3 for power spectra. These methods include periodogram 
smoothing across frequencies and the various modified periodogram averaging techniques. 
In practice, Welch’s approach to modified periodogram averaging, based on overlapped seg- 
ments, is preferred owing to its superior performance. For illustration purposes, we describe 
Welch’s approach in a brief fashion. 

In this approach, we subdivide the existing data records {x(n), y(n);0 <n < N — 1} 
into K overlapping smaller segments of length L as follows: 


xj(n) = xD + n)w(n) 


; O0<n<L-1,0<i<kK-l1 (5.4.12) 
yi(n) = y¥iD + n)w(n) 


where w(n) is a data window of length L and D = L/2 for 50 percent overlap. The 
cross-periodogram of the ith segment is given by 


L-1 * 
R, (el) = FX; (e/°)Y*(e/®) = |X xX; ime bs yj ime] (5.4.13) 
n=0 


Finally, the smoothed cross-spectrum Ry hia (e/”) is obtained by averaging K cross-periodo- 
grams as follows: 


K-1 K-1 

is . 1 aN 1 : ; 

Re) = z yRe%\S aa >> Xie!) YF (e!”) (5.4.14) 
i=0 i=0 
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Similar to (5.3.51), the DFT computation of R{*”” (e/) is given by 


~ (PA) 1 K-1[L-1 L-1 * 
Ry b= a bs mimeo | bs nine | (5.4.15) 


i=0 Ln=0 n=0 


whereO<k<N-1,N>L. 


Estimation of cospectra and quadrature spectra. Once the cross-spectrum Ry y (e/®) 
has been estimated, we can compute the estimates of all the associated real-valued spectra 
by replacing Ryy (e/®) with its estimate Re (e/”) in the definitions of these functions. To 


estimate the cospectrum, we use 
I K-1 
CPA) (ey) = Re[RO* (e/”)] = Re Laz \- xX; wna! (5.4.16) 
i=0 


and to estimate the quadrature spectrum, we use 


K-1 
OE (w) = — Im[ RO (e/”)] = — Im E dX Xx; ware!) (5.4.17) 


The analyses of bias, variance, and covariance of these estimates are similar in complexity 
to those of the autocorrelation spectral estimates, and the details can be found in Goodman 
(1957) and Jenkins and Watts (1968). 


Estimation of cross-amplitude and phase spectra. Following the definitions in (5.4.6) 
and (5.4.7), we may estimate the cross-amplitude spectrum Ay(q@) and the phase spectrum 
®,y(@) between the random processes x(n) and y(n) by 


APN (0) = VIC? +102 WP (5.4.18) 
os E(w) = tan! {- OE ()/EG® (@)} (5.4.19) 


where the estimates CY (e/”) and One (e/”) are given by (5.4.16) and (5.4.17), respec- 


tively. Since the cross-amplitude and phase spectral estimates are nonlinear functions of the 
cospectral and quadrature spectral estimates, their analysis in terms of bias, variance, and 
covariance is much more complicated. Once again, the details are available in Jenkins and 
Watts (1968). 


Estimation of coherency spectrum. The coherency spectrum is given by the magni- 
tude of the complex coherence G,y(w). Replacing Ryy(e/®), Ry (e/®), and Ry (e/®) by their 
estimates in (5.4.8), we see the estimate for the coherency spectrum is given by 


i ; m a 1/2 
6PM wy = Ree _ [Leo + ee ase) 
y fae (cia) RE) (ei) REA (ei) ROY (ej) 


with bias and variance properties similar to those of the cross-amplitude spectrum. 
In MATLAB the function 


Rxy=csd(x,y,Nfft,Fs, window (L) ,Noverlap) ; 


is available, which is similar to the psd function described in Section 5.3.3. It estimates 
the cross-spectral density of signal vectors x and y by using Welch’s method. The window 
parameter specifies a window function, Fs is the sampling frequency for plotting purposes, 


Nfft is the size of the FFT used, and Nover1lap specifies the number of overlapping samples. 
The function 


cohere (x, y,Nfft,Fs, window (L) ,Noverlap) ; 


estimates the coherency spectrum between two vectors x and y. Its values are between 0 
and 1. 


5.4.2 Estimation of Frequency Response Functions 


When random processes x(n) and y(n) are the input and output of some physical system, the 
bivariate spectral estimation techniques discussed in this section can be used to estimate the 
system characteristics, namely, its frequency response. Problems of this kind arise in many 
applications including communications, industrial control, and biomedical signal process- 
ing. In communications applications, we need to characterize a channel over which signals 
are transmitted. In this situation, a known training signal is transmitted, and the channel 
response is recorded. By using the statistics of these two signals, it is possible to estimate 
channel characteristics within a reasonable accuracy. In the industrial applications such as 
a gas furnace, the classical methods using step (or sinusoidal) inputs may be inappropriate 
because of large disturbances generated within the system. Hence, it is necessary to use 
statistical methods that take into account noise generated in the system. 

From Chapter 3, we know that if x(m) and y(7) are input and output signals of an LTI 
system characterized by the impulse response h(n), then 


y(n) = h(n) * x(n) (5.4.21) 
The impulse response /(n), in principle, can be computed through the deconvolution opera- 
tion. However, deconvolution is not always computationally feasible. If the input and output 


processes are jointly stationary, then from Chapter 3 we know that the cross-correlation be- 
tween these two processes is given by 


ryx QZ) = hD ¥ ry O (5.4.22) 
and the cross-spectrum is given by 
Ryx(e!®) = H(e!”) Ry (e/”) (5.4.23) 
: R jo 
or Heeley = Box(e"") (5.4.24) 
Rx (e/®) 


Hence, if we can estimate the auto power spectrum and cross power spectrum with reason- 


able accuracy, then we can determine the frequency response of the system. 


Consider next an LTI system with additive output noise,” as shown in Figure 5.26. This 


model situation applies to many practical problems where the input measurements x (7) are 
essentially without noise while the output measurements y(7) can be modeled by the sum 
of the ideal response y,(n) due to x(n) and an additive noise u(n), which is statistically 
independent of x (1). If we observe the input x(7) and the ideal output y, (7), the frequency 
response can be obtained by 


Ry, x (e/®) 
Ry (eJ®) 
where all signals are assumed stationary with zero mean (see Section 5.3.1). Since x(n) and 

v(n) are independent, we can easily show that 


Ry,x(e/®) = Ryx(e/”) (5.4.26) 


H(el®) = (5.4.25) 


t More general situations involving both additive input noise and additive output noise are discussed in Bendat 
and Piersol (1980). 


241 


SECTION 5.4 
Joint Signal Analysis 


242 


CHAPTER 5 
Nonparametric Power 
Spectrum Estimation 


FIGURE 5.26 
Input-output LTI system model with output noise. 


v(n) 
y(n) 


and Ry(e/®) = Ry, (e/”) + Ry (el) (5.4.27) 
where Ry, (e/°) = |H(e/®) 7? Ry (e/®) (5.4.28) 
is the ideal output PSD produced by the input. From (5.4.25) and (5.4.26), we have 


Ryx (e/®) 


HOY Gi) 


(5.4.29) 
which shows that we can determine the frequency response by using the cross power spectral 
density between the noisy output and the input signals. Given a finite record of input- 
output data {x(7), y(n) }o. - we estimate Ro (ef) and Re (ef@k) by using one of the 
previously discussed methods and then estimate H (e/) at a set of equidistant frequencies 
{we = 20k/K}Q —', that is, 


Ry. (e/*) 


H(ei*) = 2 _ 
Ry (esr) 


(5.4.30) 
The coherence function, which measures the linear correlation between two signals 
x(n) and y(n) in the frequency domain, is given by 


[Ry (e/®)/* 


2 = 
Gry (@) = R, (eJ®)Ry (ei) 


(5.4.31) 
and satisfies the inequality 0 < GZ,(w) < 1 (see Section 3.3.6). If Ryy(e/®) = 0 for 
all w, then Gi (w) = 0. On the other hand, if y(n) = h(n) * x(n), then GZ,(@) = 1 
because Ry(e/®) = |H(e/®)|? Ry (e/®) and Ryy(e/®) = H*(e/”) Ry (e/). Furthermore, 
we can show that the coherence function is invariant under linear transformations. Indeed, if 
x(n) = hy(n)*x(n) and y1(n) = ho(n)* y(n), then GS (w) = ae (w) (see Problem 5.16). 
To avoid delta function behavior at w = 0, we should remove the mean value from the data 
before we compute Cn (w). Also Ry (eJ”)Ry (e/”) > 0 to avoid division by 0. 

In practice, the coherence function is usually greater than 0 and less than 1. This may 
result from one or more of the following reasons (Bendat and Piersol 1980): 


1. Excessive measurement noise. 

2. Significant resolution bias in the spectral estimates. 

3. The system relating y(n) to x(n) is nonlinear. 

4. The output y(7) is not produced exclusively by the input x (7). 


Using (5.4.28), (5.4.25), Rry, (e/@) = H*(e/”) Ry (e/®), and (5.4.31), we obtain 
Ry, (e!°) = Gi, (@) Ry(e!”) (5.4.32) 


which is known as the coherent output PSD. Combining the last equation with (5.4.27), we 
have 


Ry(e!”) = [1 — GZ, (@)1Ry(e/”) (5.4.33) 


which can be interpreted as the part of the output PSD that cannot be produced from the 
input by using linear operations. 


Substitution of (5.4.27) into (5.4.32) results in 


Ry (e/”) 


2 = eS eee 
Gry (@) a 1 Ry (es) 


(5.4.34) 
which shows that Gi, (w) > Las Ry(e/”)/Ry(e/®) > 0 and G2,(@) > 0 as Ry(e/”)/ 
Ry (e/®) — 1. Typically, the coherence function between input and output measurements 
reveals the presence of errors and helps to identify their origin and magnitude. Therefore, the 
coherence function provides a useful tool for evaluating the accuracy of frequency response 
estimates. 

In MATLAB the function 


H = tfe(x,y,Nfft,Fs, window (L) ,Noverlap) 


is available that estimates the transfer function of the system with input signal x and output 
y using Welch’s method. The window parameter specifies a window function, Fs is the 
sampling frequency for plotting purposes, Nfft is the size of the FFT used, and Noverlap 
specifies the number of overlapping samples. 

We next provide two examples that illustrate some of the problems that may arise when 
we estimate frequency response functions by using input and output measurements. 


EXAMPLE 5.4.1. Consider the AP(4) system 
1 
1 —2.7607z—! + 3.8106z—2 — 2.6535z—3 + 0.9238z—4 


discussed in Example 5.3.2. The input is white Gaussian noise, and the output of this system is 
corrupted by additive white Gaussian noise, as shown in Figure 5.27. We wish to estimate the 


A(z) = 


frequency response of the system from a set of measurements {x(7), y(n)}q. —! Since the input 
is white, when the output signal-to-noise ratio (SNR) is very high, we can estimate the magnitude 
response of the system by computing the PSD of the output signal. However, to compute the 
phase response or a more accurate estimate of the magnitude response, we should use the joint 
measurements of the input and output signals, as explained above. 

Figure 5.27 shows estimates of the MSC function, magnitude response functions (in linear 
and log scales), and phase response functions for two different levels of output SNR: 32 and 
0 dB. When SNR = 32 dB, we note that |Gxy(q)| is near unity at almost all frequencies, as 
we theoretically expect for ideal LTI input-output relations. The estimated magnitude and phase 
responses are almost identical to the theoretical ones with the exception at the two sharp peaks 
of |H (e/®) |. Since the SNR is high, the two notches in |G,-y(@)| at the same frequencies suggest 
a bias error due to the lack of sufficient frequency resolution. When SNR = 0 dB, we see that 
|Gxy(@)| falls very sharply for frequencies above 0.2 cycle per sampling interval. We notice 
that the presence of noise increases the random errors in the estimates of magnitude and phase 
response in this frequency region, and the bias error in the peaks of the magnitude response. 
Finally, we note that the uncertainty fluctuations in |G,y(w)| increase as |Gyy(w)| — 0, as 
predicted by the formula 


stdIASY @ll 1-16 @? (5.4.35) 
IGxy()| Sy VK 


where std(-) means standard deviation and K is the number of averaged segments (Bendat and 
Piersol 1980). 


EXAMPLE 5.4.2. In this example we illustrate the use of frequency response estimation to study 
the effect of respiration and blood pressure on heart rate. Figure 5.28 shows the systolic blood 
pressure (mmHg), heart rate (beats per minute), and the respiration (mL) signals with their 
corresponding PSD functions (Grossman 1998). The sampling frequency is Fs; = 5 Hz, and 
the PSDs were estimated using the method of averaged periodograms with 50 percent overlap. 
Note the corresponding quasiperiodic oscillations of blood pressure and heart rate occurring 
approximately every 12 s (0.08 Hz). Close inspection of the heart rate time series will also reveal 
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Estimated coherence, magnitude response, and phase response for the 
AP(4) system. The solid lines show the ideal magnitude and phase 
responses. 


another rhythm corresponding to the respiratory period (about 4.3 s, or 0.23 Hz). These rhythms 
reflect nervous system mechanisms that control the activity of the heart and the circulation under 
most circumstances. 

The left column of Figure 5.29 shows the coherence, magnitude response, and phase re- 
sponse between respiration as input and heart rate as output. Heart rate fluctuates clearly at the 
respiratory frequency (here at 0.23 Hz); this is indicated by the large amount of heart rate power 
and the high degree of coherence at the respiratory frequency. Heart function is largely controlled 
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FIGURE 5.28 


Continuous systolic blood pressure (SBP), heart rate (HR), and respiration of a young 
man during quiet upright tilt and their estimated PSDs. 


by two branches of the autonomic nervous system, the parasympathetic and sympathetic. Fre- 
quency analysis of cardiovascular signals may improve our understanding of the manner in which 
these two branches interact under varied circumstances. Heart rate fluctuations at the respiratory 
frequency (termed respiratory sinus arrhythmia) are primarily mediated by the parasympathetic 
branch of the autonomic nervous system. Increases in respiratory sinus arrhythmia indicate en- 
hanced parasympathetic influence upon the heart. Sympathetic oscillations of heart rate occur 
only at slower frequencies (below 0.10 Hz) owing to the more sluggish frequency response 
characteristics of the sympathetic branch of the autonomic nervous system. 

The right column of Figure 5.29 shows the coherence, magnitude response, and phase 
response between systolic blood pressure as input and heart rate as output. Coherent oscillations 
among cardiac and blood pressure signals can often be discerned in a frequency band with a 
typical center frequency of 0.10 Hz (usual range, 0.07 to 0.12 Hz). This phenomenon has been tied 
to the cardiovascular baroreflex system, which involves baroreceptors, that is, bodies of cells in 
the carotid arteries and aorta that are sensitive to stretch. When blood pressure is increased, these 
baroreceptors fire proportionally to stretch and pressure changes, sending commands via the brain 
to the heart and circulatory system. This baroreflex system is the only known physiological system 
acting to buffer rapid and extreme surges or falls in blood pressure. Increased baroreceptor stretch, 
for example, slows the heart rate by means of increased parasympathetic activity; decreased 
baroreceptor stretch will elicit cardiovascular sympathetic activation that will speed the heart 
and constrict arterial vessels. Thus pressure drops due to a decrease in flow. The 0.10-Hz blood 
pressure oscillations (see PSD in Figure 5.28) are sympathetic in origin and are produced by 
periodic sympathetic constriction of arterial blood vessels. 
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PSD of wire gage wave data using multitaper approach 
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FIGURE 5.34 
Spectrum estimation of the wire gage wave data using the multitaper 
method in Example 5.5.2. 


and lower limits of the 95 percent confidence interval for a fixed frequency. For comparison 
purposes, the “raw” periodogram estimate is also shown as small dots. Clearly, the periodogram 
has a large variability that is reduced in the multitaper estimate. At the same time, the multitaper 
estimate is not smooth, but its variability is small enough to follow the shape of the overall 
structure. 


5.5.2 Estimation of Cross Power Spectrum 


The multitapering approach can also be extended to the estimation of the cross power spec- 
trum. Following (5.4.11), the multitaper estimator of the cross power spectrum is given by 


K-1 Lyy-1 Lyy-1 *% 
Fe . 1 ’ F 
RMD (e/®) = we(n)x(n)e Je" we(n)y(nye J" | (5.5.11) 


where w,(7) is the kth-order data taper of length Ly and a fixed-resolution bandwidth of 
2W. As with the auto power spectrum, the use of multitaper averaging reduces the variabil- 
ity of the cross-periodogram R,y (e/). Once again, the number of equivalent degrees of 
freedom for Ry (e/®) is equal to 2K. 

The real-valued functions associated with the cross power spectrum can also be es- 
timated by using the multitaper approach in a similar fashion. The cospectrum and the 
quadrature spectrum are given by 


Cy @) = Rel RY (e/)] and ONY (@) = —Im[ RY (e/)] (6.5.12) 
while the cross-amplitude spectrum and the phase spectrum are given by 
MT) 
* A MT 7 ON” (w) 
AQP) = VIENP@P +N @P and EYP) = tan '/- iy, 
Cy) 
(5.5.13) 
Finally, the coherency spectrum is given by 
m i 1/2 
sein [Cy (@)P + [Oxy @)P 
IG" (w)| = <M, DOD (5.5.14) 
Ry (e/®) Ry (e/®) 


MATLAB does not provide a function for cross power spectrum estimation using the 253 


multitaper approach. However, by using the DPss function, it is relatively straightforward SECTION 5.5 


to implement the simple averaging method of (5.5.11). Multitaper Power Spectrum 
Estimation 


EXAMPLE 5.5.3. Again consider the wire gage and the infrared gage wave data of Figure 5.24. The 
multitaper estimate of the cross power spectrum ae (e/”) of these two 4096-point sequences 
is obtained by using (5.5.11) in which the parameter W is set to 4. Figure 5.35 shows plots 
of the estimates of the auto power spectra of the two data sets in solid lines. The cross power 
spectrum of the two signals is shown with a dotted line. It is interesting to note that the two auto 
power spectra agree almost perfectly over the band up to 0.3 Hz and then reasonably well up to 
0.9 Hz, beyond which point the spectrum due to the infrared gage is consistently higher due to 
high-frequency noise inherent in the measurements. The cross power spectrum agrees with the 
two auto power spectra at low frequencies up to 0.2 Hz. Figure 5.36 contains two graphs; the 
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FIGURE 5.35 
Cross power spectrum estimation of the wave data using the multitaper 


approach. 
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Coherency and phase spectrum of the wave data using the multitaper 
approach. 
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upper graph is for the MSC while the lower one is for the phase spectrum. Consistent with our 
observation of the cross power spectrum in Figure 5.36, the MSC is almost one over these lower 
frequencies. The phase spectrum is almost a linear function over the range over which the two 
auto power spectra agree. Thus, the multitaper approach provides estimates that agree with the 
conventional techniques. 


5.6 SUMMARY 


In this chapter, we presented many different nonparametric methods for estimating the 
power spectrum of a wide-sense stationary random process. Nonparametric methods do not 
depend on any particular model of the process but use estimators that are determined entirely 
by the data. Therefore, one has to be very careful about the data and the interpretation of 
results based on them. 

We began by revisiting the topic of frequency analysis of deterministic signals. Since 
the spectrum estimation of random processes is based on the Fourier transformation of 
data, the purpose of this discussion was to identify and study errors associated with the 
practical implementation. In this regard, three problems—the sampling of the continuous 
signal, windowing of the sampled data, and the sampling of the spectrum—were isolated 
and discussed in detail. Some useful data windows and their characteristics were also given. 
This background was necessary to understand more complex spectrum estimation methods 
and their results. 

An important topic of autocorrelation estimation was considered next. Although this 
discussion was not directly related to spectrum estimation, its inclusion was appropriate 
since one important method (i.e., that of Blackman and Tukey) was based on this estimation. 
The statistical properties of the estimator and its implementation completed this topic. 

The major part of this chapter was devoted to the section on the auto power spectrum 
estimation. The classical approach was to develop an estimator from the Fourier transform of 
the given values of the process. This was called the periodogram method, and it resulted in a 
natural PSD estimator as a Fourier transform of an autocorrelation estimate. Unfortunately, 
the statistical analysis of the periodogram showed that it was not an unbiased estimator or 
a consistent estimator; that is, its variability did not decrease with increasing data record 
length. The modification of the periodogram using the data window lessened the spectral 
leakage and improved the unbiasedness but did not decrease the variance. Several examples 
were given to verify these aspects. 

To improve the statistical performance of the simple periodogram, we then looked at 
several possible improvements to the basic technique. Two main directions emerged for re- 
ducing the variance: periodogram smoothing and periodogram averaging. These approaches 
produced consistent and asymptotically unbiased estimates. The periodogram smoothing 
was obtained by applying the lag window to the autocorrelation estimate and then Fourier- 
transforming it. This method was due to Blackman and Tukey, and results of its mean and 
variance were given. The periodogram averaging was done by segmenting the data to obtain 
several records, followed by windowing to reduce spectral leakage, and finally by averaging 
their periodograms to reduce variance. This was the well-known Welch-Bartlett method, 
and the results of its statistical analysis were also given. Finally, implementations based on 
the DFT and MATLAB were given for both methods along with several examples to illustrate 
the performance of their estimates. These nonparametric methods were further extended to 
estimate the cross power spectrum, coherence functions, and transfer function. 

Finally, we presented a newer nonparametric technique for auto power spectrum and 
cross power spectrum that was based on applying several data windows or tapers to the data 
followed by averaging of the resulting modified periodograms. The basic principle behind 
this method was that if the tapers are orthonormal and properly designed (to reduce leakage), 
then the resulting periodograms can be considered to be independent at each frequency and 


hence their average would reduce the variance. Two orthogonal sets of data taper, namely, 
the Slepian and sinusoidal, were provided. The implementation using MATLAB was given, 
and examples were given to complete the chapter. 


PROBLEMS 


5.1 


5.2 


5.3 


5.4 


Let x(t), —co < tf < 00, be a continuous-time signal with Fourier transform X¢(F), —oo < 
F < ov, and let x(n) be obtained by sampling x¢(t) every T per sampling interval with its 
DTFT X(e/”). 


(a) Show that the DTFT X (e/ “) is given by 


(oe) 
Xe/)=Fs DX fFe-IFs) o=2nf Fe= 


l=—00 


1 
T 
(b) Let Xp (k) be obtained by sampling X (eJ®) every 27 /N rad per sampling interval, that is, 


. aS kF 
Xp(k) = X(eP™*N) = Fe YT Xe (= = Ty 


l=—o0o 
Then show that inverse DFT(Xp) is given by 
[o,@) 
xp(n) © IDFT(Xp) = xp(n) =) xc(nT — mNT) 
m>=—-C 


MaTLas provides two functions to generate triangular windows, namely, bart lett andtriang. 


These two functions actually generate two slightly different coefficients. 


(a) Use bartlett to generate N = 11, 31, and 51 length windows wg(n), and plot their 
samples, using the stem function. 

(b) Compute the DTFTs Wp (e/ “), and plot their magnitudes over [—7r, 2]. Determine exper- 
imentally the width of the mainlobe as a function of N. Repeat part (a) using the triang 
function. How are the lengths and the mainlobe widths different in this case? Which window 
function is an appropriate one in terms of nonzero samples? 

(c) Determine the length of the bart lett window that has the same mainlobe width as that of 
a 51-point rectangular window. 


Sidelobes of the window transform contribute to the spectral leakage due to the frequency- 
domain convolution. One measure of this leakage is the maximum sidelobe height, which 
generally occurs at the first sidelobe for all windows except the Dolph-Chebyshev window. 


(a) For simple windows such as the rectangular, Hanning, or Hamming window, the maximum 
sidelobe height is independent of window length N. Choose N = 11, 31, and 51, and 
determine the maximum sidelobe height in decibels for the above windows. 

(b) For the Kaiser window, the maximum sidelobe height is controlled by the shape parameter 
B and is proportional to 6/ sinh 6. Using several values of 6 and N, verify the relationship 
between 6 and the maximum sidelobe height. 

(c) Determine the value of £ that gives the maximum sidelobe height nearly the same as that of 
the Hamming window of the same length. Compare the mainlobe widths and the window 
coefficients of these two windows. 

(d) For the Dolph-Chebyshev window, all sidelobes have the same height A in decibels. For 
A = 40, 50, and 60 dB, determine the 3-dB mainlobe widths for N = 31 length window. 


Let x(n) be given by 
y(n) = cos@jn + cos (w2n + ) and x(n) = y(n)w(n) 


where w(n) isalength-N data window. The |X (e/®) | is computed using MATLAB and is plotted 
over [0, zr]. 
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5.5 


5.6 


5.7 


5.8 


5.9 


5.10 


(a) Let w(n) be a rectangular window. For w, = 0.252 and w2 = 0.37, determine the min- 
imum length N so that the two frequencies in the |X (e/®)|? plot are barely separable for 
any arbitrary ¢ € [—z, 2]. (You may want to consider the worst possible value of ¢ or 
experiment, using several values of ¢.) 

(b) Repeat part (a) for a Hamming window. 

(c) Repeat part (a) for a Blackman window. 


In this problem we will prove that the autocorrelation matrix Ry given in (5.2.3), in which the 
sample correlations are defined by (5.2.1), is a nonnegative definite matrix, that is, 


x? R,x >0 for every x > 0 


(a) Show that R, can be decomposed into the product xi X, where X is called a data matrix. 
Determine the form of X. 
(b) Using the above decomposition, now prove that x R,x => 0, for every x > 0. 


An alternative autocorrelation estimate 7, (/) is given in (5.2.13) and is repeated below. 


1 N-I-1 
a a x(n + 1)x*(n) O0</<L<N 
AO) = aie 
r*(-1) —-N <-L<1<0 
0 elsewhere 


(a) Show that the mean of 7, (/) is equal to r, (J) and an approximate expression for the variance 
of *,(D). 

(b) Show that the mean of the corresponding periodogram [that is, Ry(e/@) = FF. (D)]] is 
given by 


E{Ry(e!®)} = = a Ry (ce?) Wp (el? dg 


=F. 
where Wr (e/”) isthe DTFT of the rectangular window and is sometimes called the Dirichlet 
kernel. 


Consider the above unbiased autocorrelation estimator 7, (/) of a zero-mean white Gaussian 


process with variance oe 


(a) Determine the variance of 7, (/). Compute its limiting value as 1 + oo. 
(b) Repeat part (a) for the biased estimator 7, (1). Comment on any differences in the results. 


Show that the autocorrelation matrix R x formed by using 7, (/) is not nonnegative definite, that 
is, Hy 
x’ Ryx < 0 for some x > 0 


In this problem, we will show that the periodogram Ry (e/®) can also be expressed as a DTFT 
of the autocorrelation estimate 7, (J) given in (5.2.1). 


(a) Let v(n) = x(n) wR(n), where wp (n) is a rectangular window of length NV. Show that 


ry) = 20) * v*(-l) (P.1) 
N 
(b) Take the DTFT of (P.1) to show that 
N-1 
Re = ROE! 
l=—N+1 


Consider the following simple windows over 0 < n < N — 1: rectangular, Bartlett, Hanning, 
and Hamming. 


(a) Determine analytically the DTFT of each of the above windows. 
(b) Sketch the magnitude of these Fourier transforms for N = 31. 
(c) Verify your sketches by performing a numerical computation of the DTFT using MATLAB. 


5.11 The Parzen window is given by 257 


PROBLEMS 


1-6 er ay ea ee 
L L oo 


wp(l) = xe L (P.2) 
2(1l-— —<|l|<L 
L 2 
0 elsewhere 
(a) Show that its DTFT is given by 
na ee sin (wL/4) 14 ee ee 
PNT er Sera GogAy,|)\ 


Hence using the Parzen window as a correlation window always produces nonnegative 
spectrum estimates. 

(b) Using MATLaB, compute and plot the time-domain window wp(/) and its frequency-domain 
response Wp(e/”) for L = 5, 10, and 20. 

(c) From the frequency-domain plots in part (b) experimentally determine the 3-dB mainlobe 
width Aw as a function of L. 


5.12 The variance reduction ratio of a correlation window w,(/) is defined as 


var{REO (es) Ew 


a : cd 0O<ao<a7 
var{ Rx (e/®)} N 
1 od — L-1 P 
oO 
where Ey = a Wilel®)do= Yo wad 


i=—(L—1) 


(a) Using MATLAB, compute and plot Ey as a function of L for the following windows: rect- 
angular, Bartlett, Hanning, Hamming, and Parzen. 

(b) Using your computations above, show that for L > 1, the variance reduction ratio for each 
window is given by the formula in the following table. 


Window name Variance reduction factor 
Rectangular 2L/N 

Bartlett 0.667L/N 

Hanning 0.75L/N 
Hamming 0.7948L/N 

Parzen 0.539L/N 


5.13 For L > 100, the direct computation of 7,(J) using (5.3.49) is time-consuming; hence an 
indirect computation using the DFT can be more efficient. This computation is implemented by 
the following steps: 


e Given the sequence eat) ae pad enough zeros to make it a (2N — 1)-point sequence. 

e¢ Compute the Nprr-point FFT of x(1) to obtain X (k), where Nprfr is equal to the next power- 
of-2 number that is greater than or equal to 2N — 1. 

e Compute 1/N|X(k)/? to obtain R(k). 

e Compute the Nppr-point IFFT of Rk) to obtain 7; (/). 


Develop a MATLAB function rx = autocfft(x,L) which computes 7, (/),over —L < 
1 < L. Compare this function with the autoc function discussed in the chapter in terms of the 
execution time for L > 100. 
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5.14 


5.15 


5.16 


5.17 


The Welch-Bartlett estimate RPA) (k) is given by 
oe , Kol 
2) _ ’ 2 
Re = a 2X IX; (K)| 
i= 


If x() is real-valued, then the sum in the above expression can be evaluated more efficiently. 
Let K be an even number. Then we will combine two real-valued sequences into one complex- 
valued sequence and compute one FFT, which will reduce the overall computations. Specifically, 
let 


K 
gr(n) = xa) + frog) 2 =0,1,...,L-Lr=01,..,571 
Then the L-point DFT of g; (7) is given by 


a 3 S K 
Gr(k) = Xar-(k) + jXr41(k) k=0,1,...,L-—1,r=0,1,..., > -1 
(a) Show that 


IG)? + GL — bP? = 201Xo 0? + [Xorg GP] kr =0,...,2-1 


(b) Determine the resulting expression for RPA) (k) in terms of G(k). 
(c) What changes are necessary if K is an odd number? Provide detailed steps for this case. 


Since REA (e/ ®) is a PSD estimate, one can determine autocorrelation estimate piPA) (1) from 


Welch’s method as 


1 f®s ee 
APD = S / RPM (ci eIl dey (PA) 
WU J-n 
= (PA) 5 ; 
Let R, (k) be the samples of RPA) (e/®) according to 
= (PA) x . 
Ry (kK) £ REY CPt NET) 0 < k < Nppp—1 


x(PA <(PA) ; : ‘ : 
(a) Show that the IDFT FA ) (J)of R, (kK) is an aliased version of the autocorrelation estimate 


(PA) 
ry). 
(b) If the length of the overlapping data segment in Welch’s method is L, how should Nprr be 


é we eb (PA 
chosen to avoid aliasing in rs da y? 


Show that the coherence function Gi, (w) is invariant under linear transformation, that is, if 
x(n) = hy(n) * x(n) and yj (n) = ho(n) * y(n), then 


Gry (@) = GE, y»(@) 


Bartlett’s method is a special case of Welch’s method in which nonoverlapping sections of length 
L are used without windowing in the periodogram averaging operation. 
(a) Show that the ith periodogram in this method can be expressed as 


L 
Riel) = Yo Ouse I (P.5) 
l=—L 
where wp (/) is a (2L — 1)-length Bartlett window. 
(b) Let u(e/®) & fl eJ®@ «.. ef ED JT. Show that Ry; (e/”) in (P.5) can be ex- 
pressed as a quadratic product 


a . 1 ree 
Ry j(ef®) = pe ef )Ry iwle®) (P.6) 


where Ry j is the autocorrelation matrix of 7, ;(/) values. 
(c) Finally, show that the Bartlett estimate is given by 


K 
rs : 1 co 
RY (el) = — Fut e/)R, ile”) (7) 


i=1 


5.18 


5.19 


5.20 


5.21 


5.22 


5.23 


In this problem, we will explore a spectral estimation technique that uses combined data and 
correlation weighting (Carter and Nuttall 1980). In this technique, the following steps are per- 
formed: 


e Given {x an). compute the Welch-Bartlett estimate REA) (e/®) by choosing the appro- 
priate values of L and D. 
e Compute the autocorrelation estimate 7; p(PA) (1), 


in Problem 5.15. 
e Window 7 ple A) (1), using a lag window wa (J) to obtain 7 pfCN) (lI) = p OPA) ()wa(l). 
e Finally, compute the DTFT of 7£ON) (J) to obtain the new spectrum estimate RY (ef @). 


—L <1 < L, using the approach described 


(a) Determine the bias of Ro (eJ®), 

(b) Comment on the effect of additional windowing on the variance and resolution of the 
estimate. 

(c) Implement this technique in MATLAB, and compute spectral estimates of the process con- 
taining three sinusoids in white noise, which was discussed in the chapter. Experiment with 
various values of L and with different windows. Compare your results to those given for 
the Welch-Bartlett and Blackman-Tukey methods. 


Explain why we use the scaling factor 
EI 
> w*(n) 
n=0 
which is the energy of the data window in the Welch-Bartlett method. 


Consider the basic periodogram estimator Ry (e/®) at the zero frequency, that is, atm = 0. 


(a) Show that 


; 1 N-1 ; 1 N-1 2 
Re(e) = S| ¥7 xe] = 10 x) 
n=0 n=0 


(b) If x(n) is areal-valued, zero-mean white Gaussian process with variance o%, determine the 
mean and variance of R (et iO). 
(c) Determine if R x (e/ ) is a consistent estimator by evaluating the variance as N > ov. 


Consider Bartlett’s method for estimating Ry (e/°) using L = 1; that is, we use nonoverlapping 
segments of single samples. The periodogram of one sample x(n) is simply |x(n)|?. Thus we 
have 


N-1 
RP (el) ry Ry n(el) = ee lx(n) |? 
eo 
2 


Again assume that x(n) is a real-valued white Gaussian process with variance o~. 


(a) Determine the mean and variance of R® (e/ 0). 
(b) Compare the above result with those in Problem 5.20. Comment on any differences. 


One desirable property of lag or correlation windows is that their Fourier transforms are non- 
negative. 


(a) Formulate a procedure to generate a symmetric lag window of length 2L + 1 with nonneg- 
ative Fourier transform. 

(b) Using the Hanning window as a prototype in the above procedure, determine and plot a 
31-length lag window. Also plot its Fourier transform. 


Consider the following random process 
4 


x(n) = D> Agsin (en + og) + vn) 
k=1 
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5.24 


5.25 


5.26 


5.27 


5.28 


5.29 


where Aj =1 Az =0.5 A3 =0.5 Aq = 0.25 
Q,= 0.1 a2 = 0.67 03> 0.652 oO4 = 0.87 


and the phases {¢; ae are IID random variables uniformly distributed over [—z, 2]. Generate 
50 realizations of x(n) for 0 < n < 256. Let v(n) be WN(0, 1). 


(a) Compute the Blackman-Tukey estimates for L = 32, 64, and 128, using the Bartlett lag 
window. Plot your results, using overlay and averaged estimates. Comment on your plots. 

(b) Repeat part (a), using the Parzen window. 

(c) Provide a qualitative comparison between the above two sets of plots. 


Consider the random process given in Problem 5.23. 


(a) Compute the Bartlett estimate, using L = 16, 32, and 64. Plot your results, using overlay 
and averaged estimates. Comment on your plots. 

(b) Compute the Welch estimate, using 50 percent overlap, Hamming window, and L = 16, 
32, and 64. Plot your results, using overlay and averaged estimates. Comment on your plots. 

(c) Provide a qualitative comparison between the above two sets of plots. 


Consider the random process given in Problem 5.23. 


(a) Compute the multitaper spectrum estimate, using K = 3, 5, and 7 Slepian tapers. Plot your 
results, using overlay and averaged estimates. Comment on your plots. 

(b) Make a qualitative comparison between the above plots and those obtained in Problems 
5.23 and 5.24. 


Generate 1000 samples of an AR(1) process using a = —0.9. Determine its theoretical PSD. 


(a) Determine and plot the periodogram of the process along with the true spectrum. Comment 
on the plots. 

(b) Compute the Blackman-Tukey estimates for L = 10, 20, 50, and 100. Plot these estimates 
along with the true spectrum. Comment on your results. 

(c) Compute the Welch estimates for 50 percent overlap, Hamming window, and L = 10, 20, 
50, and 100. Plot these estimates along with the true spectrum. Comment on your results. 


Generate 1000 samples of an AR(1) process using a = 0.9. Determine its theoretical PSD. 


(a) Determine and plot the periodogram of the process along with the true spectrum. Comment 
on the plots. 

(b) Compute the Blackman-Tukey estimates for L = 10, 20, 50, and 100. Plot these estimates 
along with the true spectrum. Comment on your results. 

(c) Compute the Welch estimates for 50 percent overlap, Hamming window, and L = 10, 20, 
50, and 100. Plot these estimates along with the true spectrum. Comment on your results. 


Multitaper estimation technique requires a properly designed orthonormal set of tapers for the 
desired performance. One set discussed in the chapter was that of harmonically related sinusoids 
given in (5.5.8). 


(a) Design a MATLAB function [tapers] = sine_tapers(N,K) that generates K < N si- 
nusoidal tapers of length V. 

(b) Using the above function, compute and plot the Fourier transform magnitudes of the first 5 
tapers of length 51. 


Design a MATLAB function Pxx = psd_sinetaper (x,K) that determines the multitaper es- 
timates using the sine tapers. 


(a) Apply the function psd_sinetaper to the AR(1) process given in Problem 5.26, and 
compare its performance. 

(b) Apply the function psd_sinetaper to the AR(1) process given in Problem 5.27, and 
compare its performance. 


CHAPTER 6 


Optimum Linear Filters 


In this chapter, we present the theory and application of optimum linear filters and predictors. 
We concentrate on linear filters that are optimum in the sense of minimizing the mean square 
error (MSE). The minimum MSE (MMSE) criterion leads to a theory of linear filtering that 
is elegant and simple, involves only second-order statistics, and is useful in many practical 
applications. The optimum filter designed for a given set of second-order moments can be 
used for any realizations of stochastic processes with the same moments. 

We start with the general theory of linear MMSE estimators and their computation, 
using the triangular decomposition of Hermitian positive definite matrices. Then we ap- 
ply the general theory to the design of optimum FIR filters and linear predictors for both 
nonstationary and stationary processes (Wiener filters). We continue with the design of 
nonparametric (impulse response) and parametric (pole-zero) optimum IIR filters and pre- 
dictors for stationary processes. Then we present the design of optimum filters for inverse 
system modeling, blind deconvolution, and their application to equalization of data commu- 
nication channels. We conclude with a concise introduction to optimum matched filters and 
eigenfilters that maximize the output SNR. These signal processing methods find extensive 
applications in digital communication, radar, and sonar systems. 


6.1 OPTIMUM SIGNAL ESTIMATION 


As we discussed in Chapter 1, the solution of many problems of practical interest depends 
on the ability to accurately estimate the value y(n) of a signal (desired response) by using 
a set of values (observations or data) from another related signal or signals. Successful 
estimation is possible if there is significant statistical dependence or correlation between 
the signals involved in the particular application. For example, in the linear prediction 
problem we use the M past samples x(n — 1), x(n — 2),...,x(n — M) of a signal to 
estimate the current sample x(7). The echo canceler in Figure 1.17 uses the transmitted 
signal to form a replica of the received echo. The radar signal processor in Figure 1.27 uses 
the signals x,(m) for | < k < M received by the linear antenna array to estimate the value 
of the signal y(7) received from the direction of interest. Although the signals in these and 
other similar applications have different physical origins, the mathematical formulations of 
the underlying signal processing problems are very similar. 

In array signal processing, the data are obtained by using M different sensors. The 
situation is simpler for filtering applications, because the data are obtained by delaying 
a single discrete-time signal; that is, we have x,(n) = x(n+1—k),1 < k < M (see 
Figure 6.1). Further simplifications are possible in linear prediction, where both the desired 
response and the data are time samples of the same signal, for example, y(n) = x(n) and 
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X,(n) 


x(n) 


Xy(n) 


x(n-M) x(n — 1) x(n) 


(a) 


x(n—M-1) xm-M) xm-M+4+1) ***  x(n—2) x(n-1) x(n) 


I I Vv I I Vv 

| x(n) 

] ] ] ] 

| v | Vv 

x(n — 1) 

v | 


v 
x(n — 2) 


(b) 


FIGURE 6.1 
Illustration of the data vectors for (a) array processing (multiple 
sensors) and (b) FIR filtering or prediction (single sensor) applications. 


x(n) = x(n —k),1 < k < M. As aresult, the design and implementation of optimum 
filters and predictors are simpler than those for an optimum array processor. 

Since array processing problems are the most general ones, we will formulate and solve 
the following estimation problem: Given a set of data x,(n) for 1 < k < M, determine an 
estimate }(7), of the desired response y(n), using the rule (estimator) 


3(n) = A{xe(n), 1 <k < M} (6.1.1) 


which, in general, is a nonlinear function of the data. When x,(m) = x(n + 1 — k), the 
estimator takes on the form of a discrete-time filter that can be linear or nonlinear, time- 
invariant or time-varying, and with a finite- or infinite-duration impulse response. Linear 
filters can be implemented using any direct, parallel, cascade, or lattice-ladder structure (see 
Section 2.5 and Proakis and Manolakis 1996). 

The difference between the estimated response j(n) and the desired response y(n), 
that is, 


e(n) = y(n) — $(n) (6.1.2) 


is known as the error signal. We want to find an estimator whose output approximates the 
desired response as closely as possible according to a certain performance criterion. We use 
the term optimum estimator or optimum signal processor to refer to such an estimator. We 
stress that optimum is not used as a synonym for best; it simply means the best under the given 
set of assumptions and conditions. If either the criterion of performance or the assumptions 
about the statistics of the processed signals change, the corresponding optimum filter will 
change as well. Therefore, an optimum estimator designed for a certain performance metric 
and set of assumptions may perform poorly according to some other criterion or if the actual 
statistics of the processed signals differ from the ones used in the design. For this reason, the 
sensitivity of the performance to deviations from the assumed statistics is very important 
in practical applications of optimum estimators. 


Therefore, the design of an optimum estimator involves the following steps: 


1. Selection of a computational structure with well-defined parameters for the implemen- 
tation of the estimator. 

2. Selection of a criterion of performance or cost function that measures the performance 
of the estimator under some assumptions about the statistical properties of the signals to 
be processed. 

3. Optimization of the performance criterion to determine the parameters of the optimum 
estimator. 

4. Evaluation of the optimum value of the performance criterion to determine whether the 
optimum estimator satisfies the design specifications. 


Many practical applications (e.g., speech, audio, and image coding) require subjective 
criteria that are difficult to express mathematically. Thus, we focus on criteria of performance 
that (1) only depend on the estimation error e(n), (2) provide a sufficient measure of the 
user satisfaction, and (3) lead to a mathematically tractable problem. We generally select a 
criterion of performance by compromising between these objectives. 

Since, in most applications, negative and positive errors are equally harmful, we should 
choose a criterion that weights both negative and positive errors equally. Choices that satisfy 
this requirement include the absolute value of the error |e(n)|, or the squared error |e(n)|7, 
or some other power of |e(n)| (see Figure 6.2). The emphasis put on different values of 
the error is a key factor when we choose a criterion of performance. For example, the 
squared-error criterion emphasizes the effect of large errors much more than the absolute 
error criterion. Thus, the squared-error criterion is more sensitive to outliers (occasional 
large values) than the absolute error criterion is. 


Weight 


Error value e 


FIGURE 6.2 
Graphical illustration of various error-weighting functions. 


To develop a mathematical theory that will help to design and analyze the performance 
of optimum estimators, we assume that the desired response and the data are realizations of 
stochastic processes. Furthermore, although in practice the estimator operates on specific 
realizations of the input and desired response signals, we wish to design an estimator with 
good performance across all members of the ensemble, that is, an estimator that “works 
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well on average.” Since, at any fixed time n, the quantities y(), x,(n) for 1 < k < M,and 
e(n) are random variables, we should choose a criterion that involves the ensemble or time 
averaging of some function of |e(7)|. Here is a short list of potential criteria of performance: 


1. The mean square error criterion 
P(n) = Efle(n)|"} (6.1.3) 


which leads, in general, to a nonlinear optimum estimator. 

2. The mean ath-order error criterion E{|e(n)|~}, a 4 2. Using a lower- or higher-order 
moment of the absolute error is more appropriate for certain types of non-Gaussian 
statistics than the MSE (Stuck 1978). 

3. The sum of squared errors (SSE) 


nf 
E(nj,np) = > e~)/? (6.1.4) 


n=Nnj 
which, if it is divided by n ¢ — n; + 1, provides an estimate of the MSE. 


The MSE criterion (6.1.3) and the SSE criterion (6.1.4) are the most widely used because 
they (1) are mathematically tractable, (2) lead to the design of useful systems for practical 
applications, and (3) can serve as a yardstick for evaluating estimators designed with other 
criteria (e.g., signal-to-noise ratio, maximum likelihood). In most practical applications, we 
use linear estimators, which further simplifies their design and evaluation. 

Mean square estimation is a rather vast field that was originally developed by Gauss in 
the nineteenth century. The current theories of estimation and optimum filtering started with 
the pioneering work of Wiener and Kolmogorov that was later extended by Kalman, Bucy, 
and others. Some interesting historical reviews are given in Kailath (1974) and Sorenson 
(1970). 


6.2 LINEAR MEAN SQUARE ERROR ESTIMATION 


In this section, we develop the theory of linear MSE estimation. We concentrate on linear es- 
timators for various reasons, including mathematical simplicity and ease of implementation. 
The problem can be stated as follows: 


Design an estimator that provides an estimate }(n) of the desired response y(n) 
using a linear combination of the data x,(n) for 1 < k < M, such that the MSE 
E{|y() - 3(n)/7} is minimized. 


More specifically, the linear estimator is defined by 

M 

S(n) = So ckn)xx(n) (6.2.1) 

k=1 
and the goal is to determine the coefficients cy,(m) for 1 < k < M such that the MSE 
(6.1.3) is minimized. In general, a new set of optimum coefficients should be computed for 
each time instant n. Since we assume that the desired response and the data are realizations 
of stochastic processes, the quantities y(m), x1(”),...,Xy(m) are random variables at any 
fixed time n. For convenience, we formulate and solve the estimation problem at a fixed 
time instant n. Thus, we drop the time index 7 and restate the problem as follows: 


Estimate a random variable y (the desired response) from a set of related random 
variables x1, x2,..., xy (data) using the linear estimator 


$45 ctx, =e? x (6.2.2) 


where x = [x1 x2 --- xyJ2 (6.2.3) 
is the input data vector and 

c=[c1 c2 --: cy)! (6.2.4) 
is the parameter or coefficient vector of the estimator. 


Unless otherwise stated, all random variables are assumed to have zero-mean values. The 
number M of data components used is called the order of the estimator. The linear estimator 
(6.2.2) is represented graphically as shown in Figure 6.3 and involves a computational 
structure known as the linear combiner. The MSE 


P & Efle|*} (6.2.5) 
where efy-3 (6.2.6) 


is a function of the parameters cy. Minimization of (6.2.5) with respect to parameters cx; 
leads to a linear estimator c, that is optimum in the MSE sense. The parameter vector Cy, is 
known as the linear MMSE (LMMSE) estimator and yo as the LMMSE estimate. 


Data Linear combiner Desired 
mal +(x) response 


Xy Error 
e 
Estimate 
XM 
cM 
Estimator 
parameters 
FIGURE 6.3 


Block diagram representation of the linear estimator. 


6.2.1 Error Performance Surface 


To determine the linear MMSE estimator, we seek the value of the parameter vector c that 
minimizes the function (6.2.5). To this end, we want to express the MSE as a function of 
the parameter vector c and to understand the nature of this dependence. 

By using (6.2.5), (6.2.6), (6.2.2), and the linearity property of the expectation operator, 
the MSE is given by 


P(c) = Eflel?} = E{(y — # x)(* — x%e)} 
= Ef|y|?} — ce” E{xy*} — E{yx Je +c” E{xx" te 
or more compactly, 
P(c) = Py —c#a—d@c+c#Re (6.2.7) 
where Py © Efly|?} (6.2.8) 
is the power of the desired response, 


d= E{xy*} (6.2.9) 
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FIGURE 6.4 


WU\\\ 
NuUthNye 


is the cross-correlation vector between the data vector x and the desired response y, and 
R * E{xx”} (6.2.10) 


is the correlation matrix of the data vector x. The matrix R is guaranteed to be Hermitian 
and nonnegative definite (see Section 3.4.4). 

The function P(c) is known as the error performance surface of the estimator. Equation 
(6.2.7) shows that the MSE P(c) (1) depends only on the second-order moments of the 
desired response and the data and (2) is a quadratic function of the estimator coefficients 
and represents an (M + 1)-dimensional surface with M degrees of freedom. We will see 
that if R is positive definite, then the quadratic function P(c) is bowl-shaped and has a 
unique minimum that corresponds to the optimum parameters. The next example illustrates 
this fact for the second-order case. 


EXAMPLE 6.2.1. If M = 2 and the random variables y, x;, and x2 are real-valued, the MSE is 
P(cq, C2) — Py - 2d\c 1 - 2dyc2 + ryict + 2rj2¢1C¢2 + r9¢4 


because r}7 = r21.And P (cy, cz) isa second-order function of coefficients c; and cz, and Figure 
6.4 shows two plots of the function P (cj, cz) that are quite different in appearance. The surface 
in Figure 6.4(a) looks like a bowl and has a unique extremum that is a minimum. The values for 
the error surface parameters are Py = 0.5, rj) = r22 = 4.5, r}2 = r21 = —0.1545, dj = —0.5, 
and dz = —0.1545. On the other hand, in Figure 6.4(b), we have a saddle point that is neither a 
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Representative surface and contour plots for positive definite and negative definite quadratic error performance 


surfaces. 


minimum nor a maximum (here only the matrix elements have changed to 711 = ro2 = 1,712 = 267 


r2| = 2). If we cut the surfaces with planes parallel to the (cj, cz) plane, we obtain contours SECTION 6.2 
of constant MSE that are shown in Figure 6.4(c) and (d). In conclusion, the error performance Linear Mean Square Error 
surface is bowl-shaped and has a unique minimum only if the matrix R is positive definite (the Estimation 


determinants of the two matrices are 20.23 and —3, respectively). Only in this case can we 
obtain an estimator that minimizes the MSE, and the contours are concentric ellipses whose 
center corresponds to the optimum estimator. The bottom of the bowl is determined by setting 
the partial derivatives with respect to the unknown parameters to zero, that is, 


dP (cy, €2) ; : 
en =0 which results in rye] Hry2cS = dy 
dP (cy, ¢2) ; : 
ee =0 which results in ryacf +r22¢5 = do 


This is a linear system of two equations with two unknowns whose solution provides the coeffi- 
cients c? and c§ that minimize the MSE function P (cy, cz). 


When the optimum filter is specified by a rational system function, the error perfor- 
mance surface may be nonquadratic. This is illustrated in the following example. 


EXAMPLE 6.2.2. Suppose that we wish to estimate the real-valued output y(n) of the “unknown” 
system (see Figure 6.5) 

0.05 — 0.427! 
1 — 1.131421 4 0.252-2 


G(z) = 
using the pole-zero filter 


H(z) = 


l—az7! 


by minimizing the MSE FE {e2 (n)} (Johnson and Larimore 1977). The input signal x (7) is white 
noise with zero mean and variance on. The MSE is given by 


E{e?(n)} = E{ly(n) — $(n) 7} = Ely? (n)} — 2Efy(n)5()} + ELH? )} 


and is a function of parameters b and a. Since the impulse response h(n) = ba"u(n) of the 
optimum filter has infinite duration, we cannot use (6.2.7) to compute E {e2 (n)} and to plot the 
error surface. The three components of E {e2(n)} can be evaluated as follows, using Parseval’s 
theorem: The power of the desired response 


oo 2 
2 2, 2 a: -1\,-1 2_2 
BUM) = 0% 80) = a $ G@GENe ld: 4 030} 
me 


is constant and can be computed either numerically by using the first M “nonzero” samples 
of g(n) or analytically by evaluating the integral using the residue theorem. The power of the 
optimum filter output is 


00 2 2 
i o Pinte b 
ES) = EL) YP) = 22Gb HH EH dz = 02 
2nj l-a 
n=0 
“Unknown” FIGURE 6.5 
system Identification of an “unknown” system using 


an optimum filter. 
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Optimum Linear Filters Ef{y(n)3(n)} = E y e(k)x(n = k) > h(m)x(n ‘= m) 
k=0 m=0 
CO o2 
Bee 2. - * —ly\,-1 = 
=0% >) gk)h(k) = ee § GHG je" de = b6@)| 14 
k=0 
because E{x(n — k)x(n — m)} = o2d(m —k). For convenience we compute the normalized 
MSE 
E{e? 2b ain Be 
Po,ay 2 EO _ 2 Baw) +B 
og Og z7l=a og l-a 


whose surface and contour plots are shown in Figure 6.6. We note that the resulting error per- 
formance surface is bimodal with a global minimum P = 0.277 at (b,a) = (—0.311, 0.906) 
and a local minimum P = 0.976 at (b,a) = (0.114, —0.519). As a result, the determination 
of the optimum filter requires the use of nonlinear optimization techniques with all associated 
drawbacks. 
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FIGURE 6.6 

Illustration of the nonquadratic form of the error performance surface 
of a pole-zero optimum filter specified by the coefficients of its 
difference equation. 


6.2.2 Derivation of the Linear MMSE Estimator 


The approach in Example 6.2.1 can be generalized to obtain the necessary and sufficient 
conditions that determine the linear MMSE estimator.’ Here, we present a simpler matrix- 
based approach that is sufficient for the scope of this chapter. 

We first notice that we can put (6.2.7) into the form of a “perfect square” as 


P(c) = Py —d"?R7'd + (Re — d)"R7! (Re — d) (6.2.11) 


where only the third term depends one. If R is positive definite, the inverse matrix R! exists 


t ‘ eee : 
For complex-valued random variables, there are some complications that should be taken into account because 
le|? is not an analytic function. This topic is discussed in Appendix B. 


and is positive definite; that is, z2R-!z >0 for allz + 0. Therefore, if R is positive definite, 
the term d” R~!d >0 decreases the cost function by an amount determined exclusively by 
the second-order moments. In contrast, the term (Re — d) R~! (Re — d) > Oincreases the 
cost function depending on the choice of the estimator parameters. Thus, the best estimator 
is obtained by setting Re — d = 0. 

Therefore, the necessary and sufficient conditions that determine the linear MMSE 
estimator Cy are 


Re, = d (6.2.12) 
and R is positive definite (6.2.13) 
In greater detail, (6.2.12) can be written as 
Ty fi oc Tim Cl dy 
ro) a2, QM c2 dy 
=|. (6.2.14) 
rMi 'm2 ++: ‘TMM. Lem dm 
where i= E{xixt} = 15, (6.2.15) 
and d; = E{xjy*} (6.2.16) 


and are known as the set of normal equations. The invertibility of the correlation matrix 
R—and hence the existence of the optimum estimator—is guaranteed if R is positive 
definite. In theory, R is guaranteed to be nonnegative definite, but in physical applications 
it will almost always be positive definite. The normal equations can be solved by using any 
general-purpose routine for a set of linear equations. 

Using (6.2.11) and (6.2.12), we find that the MMSE P, is 


P, = Py —d#R7'd = P, — dc, (6.2.17) 


where we can easily show that the term dc, is equal to E{|+,|7}, the power of the optimum 
estimate. Ifx and y are uncorrelated (d = 0), we have the worst situation (Py = Py) because 
there is no linear estimator that can reduce the MSE. If d # 0, there is always going to be 
some reduction in the MSE owing to the correlation between the data vector x and the 
desired response y, assuming that R is positive definite. The best situation corresponds to 
3 = y, which gives P, = 0. Thus, for comparison purposes, we use the normalized MSE 


P. Ps 
6S 21-2 (6.2.18) 
Py Py 
because it is bounded between 0 and 1, that is, 
0<E<1 (6.2.19) 


If ¢ is the deviation from the optimum vector Cp, that is, if c = c, + ¢€, then substituting into 
(6.2.11) and using (6.2.17), we obtain 


P(Co +©) = Pleo) + C7 RE (6.2.20) 


Equation (6.2.20) shows that if R is positive definite, any deviation ¢ from the optimum 
vector Cy increases the MSE by an amount ¢#RE > 0, which is known as the excess MSE, 
that is, 


Excess MSE £ P(e, +0) — P(e,) = ¢/ RE (6.2.21) 


We emphasize that the excess MSE depends only on the input correlation matrix and not 
on the desired response. This fact has important implications because any deviation from 
the optimum can be detected by monitoring the MSE. 
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For nonzero-mean random variables, we use the estimator £ co +c"x. The elements 
of R and d are replaced by the corresponding covariances and cy = E{y} — ¢# E{x} (see 
Problem 6.1). In the sequel, unless otherwise explicitly stated, we assume that all random 
variables have zero mean or have been reduced to zero mean by replacing y by y — E{y} 
and x by x — E{x}. 


6.2.3 Principal-Component Analysis of the Optimum Linear Estimator 


The properties of optimum linear estimators and their error performance surfaces depend 
on the correlation matrix R. We can learn a lot about the nature of the optimum estimator 
if we express R in terms of its eigenvalues and eigenvectors. Indeed, from Section 3.4.4, 
we have 


M 
R=QAQ4 = oS rxqia? and A=Q*RQ (6.2.22) 
i=l 
where A = diag{\1, A2,...,Am} (6.2.23) 


are the eigenvalues of R, assumed to be distinct, and 


Q=I[q1 a --- qu] (6.2.24) 
are the eigenvectors of R. The modal matrix Q is unitary, that is, 
Q"Q=1 (6.2.25) 


which implies that Q-! = Q”. The relationship (6.2.22) between R and A is known as a 
similarity transformation. 

In general, the multiplication of a vector by a matrix changes both the length and the 
direction of the vector. We define a coordinate transformation of the optimum parameter 
vector by 


&4Q%ce, or =e * Qe, (6.2.26) 
Since lIeoll = (Qe) Qe’, = €/7Q" Qe, = Ile II (6.2.27) 


the transformation (6.2.26) changes the direction of the transformed vector but not its length. 
If we substitute (6.2.22) into the normal equations (6.2.12), we obtain 


QAQ“cg=d or = AQ*e, = Q”d 
which results in 


Ac, =d' (6.2.28) 


where d2Q4d or d#=Qd (6.2.29) 


is the transformed “decoupled” cross-correlation vector. 
Because A is diagonal, the set of M equations (6.2.28) can be written as 


hic,;=di 1<i<M (6.2.30) 


where c’, , and d: are the components of ¢/, and d’, respectively. This is an uncoupled set of 


M first-order equations. If A; 4 0, then 
d; 
=a 1<i<M (6.2.31) 
0,1 hi =f 


and if A; = 0, the value of c. , 1s indeterminate. 


The MMSE becomes 
Py = Py — dc, 
= Py — (Qd’)"Qe) = Py — de), 
\d;|? 


M M 
=Py—)odite,,;=Py- >> re 


i=1 i=1 


(6.2.32) 


which shows how the eigenvalues and the decoupled cross-correlations affect the perfor- 
mance of the optimum filter. The advantage of (6.2.31) and (6.2.32) is that we can study the 
behavior of each parameter of the optimum estimator independently of all the remaining 
ones. 

To appreciate the significance of the principal-component transformation, we will dis- 
cuss the error surface of a second-order estimator. However, all the results can be easily 
generalized to estimators of order M, whose error performance surface exists in a space of 
M + | dimensions. Figure 6.7 shows the contours of constant MSE for a positive definite, 
second-order error surface. The contours are concentric ellipses centered at the tip of the op- 
timum vector ¢,. We define a new coordinate system with origin at ¢, and axes determined 
by the major axis Vv; and the minor axis V2 of the ellipses. The two axes are orthogonal, and 
the resulting system is known as the principal coordinate system. The transformation from 
the “old” system to the “new” system is done in two steps: 


Translation: Co 


=c-— 
6.2.33 
Rotation: ¥=Q%% ) 
where the rotation changes the axes of the space to match the axes of the ellipsoid. The 
excess MSE (6.2.21) becomes 


M 
AP(W) = CURE = C7 QAQTE = WAT = YAH (6.2.34) 


i=1 


which shows that the penalty paid for the deviation of a parameter from its optimum value 
is proportional to the corresponding eigenvalue. Clearly, changes in uncoupled parameters 
(which correspond to A; = 0) do not affect the excess MSE. 

Using (6.2.22), we have 
AH d M oy 


ae Beary (6.2.35) 
4 


M 
co =R'd=QA 'Q"d=)0 
i=l 


FIGURE 6.7 

Contours of constant MSE and 
principal-component axes for a 
second-order quadratic error 
surface. 
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and the optimum estimate can be written as 
My 
So = gx = D1 = (Qi"®) (6.2.36) 
i=l" 
which leads to the representation of the optimum estimator shown in Figure 6.8. The eigen- 


filters q; decorrelate the data vector x into its principal components, which are weighted 
and added to produce the optimum estimate. 


Optimum 
estimate 


M’“M 
FIGURE 6.8 

Principal-components representation of the optimum linear 
estimator. 


6.2.4 Geometric Interpretations and the Principle of Orthogonality 


It is convenient and pedagogic to think of random variables with zero mean value and finite 
variance as vectors in an abstract vector space with an inner product (i.e., a Hilbert space) 
defined by their correlation 


(x, y) = Efxy*} (6.2.37) 
and the length of a vector by 
IIx? & (x, x) = Eflx]?} < 00 (6.2.38) 


From the definition of the correlation coefficient in Section 3.2.1 and the above definitions, 
we obtain 


(x, y)|? < llllllyl (6.2.39) 


which is known as the Cauchy-Schwartz inequality. Two random variables are orthogonal, 
denoted by x L y, if 


(x, y) = E{xy*} =0 (6.2.40) 


which implies they are uncorrelated since they have zero mean. 

This geometric viewpoint offers an illuminating and intuitive interpretation for many 
aspects of MSE estimation that we will find very useful. Indeed, using (6.2.9), (6.2.10), and 
(6.2.12), we have 


E{xe*} = E{x(y* — x" ¢,)} = E{xy*} — E{xx"}e, = d— Re, = 0 
Therefore E{xe*} =0 (6.2.41) 
or E{xmes}=0 forl<m<M (6.2.42) 


that is, the estimation error is orthogonal to the data used for the estimation. Equations 


(6.2.41), or equivalently (6.2.42), are known as the orthogonality principle and are widely 
used in linear MMSE estimation. 

To illustrate the use of the orthogonality principle, we note that any linear combination 
ee She stele oh CyxXM lies in the subspace defined by the vectors’ X1,...,Xy. Therefore, 
the estimate ) that minimizes the squared length of the error vector e, that is, the MSE, is 
determined by the foot of the perpendicular from the tip of the vector y to the “plane” defined 
by vectors x1, ..., xy. This is illustrated in Figure 6.9 for M = 2. Since e, is perpendicular 
to every vector in the plane, we have x,,Le,, 1 < m < M, which leads to the orthogonality 
principle (6.2.42). Conversely, we can start with the orthogonality principle (6.2.41) and 
derive the normal equations. This interpretation has led to the name normal equations for 
(6.2.12). We will see several times that the concept of orthogonality has many important 
theoretical and practical implications. As an illustration, we apply the Pythagorean theorem 
to the orthogonal triangle formed by vectors ¥,, eg, and y, in Figure 6.9, to obtain 


ly ll? = I 5oll? + leoll 
or E{lyl?} = E{lSol?} + Efleol7} (6.2.43) 


which decomposes the power of the desired response into two components, one that is 
correlated to the data and one that is uncorrelated to the data. 


A FIGURE 6.9 

Pictorial illustration of the orthogonality 
principle. For random vectors orthogonality 
holds on the “average.” 


6.2.5 Summary and Further Properties 


We next summarize, for emphasis and future reference, some important properties of opti- 
mum, in the MMSE sense, linear estimators. 


1. Equations (6.2.12) and (6.2.17) show that the optimum estimator and the MMSE depend 
only on the second-order moments of the desired response and the data. The dependence 
on the second-order moments is a consequence of both the linearity of the estimator and 
the use of the MSE criterion. 

2. The error performance surface of the optimum estimator is a quadratic function of its 
coefficients. If the data correlation matrix is positive definite, this function has a unique 
minimum that determines the optimum set of coefficients. The surface can be visualized 
as a bowl, and the optimum estimator corresponds to the bottom of the bowl. 


‘We should be careful to avoid confusing vector random variables, that is, vectors whose components are random 
variables, and random variables interpreted as vectors in the abstract vector space defined by Equations (6.2.37) 
to (6.2.39). 
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3. If the data correlation matrix R is positive definite, any deviation from the optimum 
increases the MMSE according to (6.2.21). The resulting excess MSE depends on R 
only. This property is very useful in the design of adaptive filters. 

4. When the estimator operates with the optimum set of coefficients, the error e, is uncor- 
related (orthogonal) to both the data x1, x2,..., xyg and the optimum estimate ,. This 
property is very useful if we want to monitor the performance of an optimum estimator 
in practice and is used also to design adaptive filters. 

5. The MMSE, the optimum estimator, and the optimum estimate can be expressed in terms 
of the eigenvalues and eigenvectors of the data correlation matrix. See (6.2.32), (6.2.35), 
and (6.2.36). 

6. The general (unconstrained) estimator 


3 F h(x) = h(x, x2,..-, XM) 
that minimizes the MSE 
P = E{ly —h(x)|} 


with respect to h(x) is given by the mean of the conditional density, that is, 
[o,@) 


5 tee) = By ix = / ypy(y|x) dy 


and clearly is a nonlinear function of x1,..., xy. If the desired response and the data 
are jointly Gaussian, the linear MMSE estimator is the best in the MMSE sense; that 
is, we cannot find a nonlinear estimator that produces an estimate with smaller MMSE 
(Papoulis 1991). 


6.3 SOLUTION OF THE NORMAL EQUATIONS 


In this section, we present a numerical method for the solution of the normal equations 
and the computation of the minimum error, using a slight modification of the Cholesky 
decomposition of Hermitian positive definite matrices known as the lower-diagonal-upper 
decomposition, or LDL" decomposition for short. 

Hermitian positive definite matrices can be uniquely decomposed into the product of 
a lower triangular and a diagonal and an upper triangular matrix as 


R =LDL? (6.3.1) 
where L is a unit lower-triangular matrix 
1 0 Le 0 
| ho 1 Le (0 
L4|. “ot (6.3.2) 
Im-1,0 Im-11 -*: 1 
and D = diag{é,, &5,...,Ey} (6.3.3) 


is a diagonal matrix with strictly real, positive elements. When the decomposition (6.3.1) 
is known, we can solve the normal equations 


Re, = LD(L"¢,) = d (6.3.4) 
by solving the lower triangular system 
LDk =d (6.3.5) 
for the intermediate vector k and the upper triangular system 


Lic, =k (6.3.6) 


for the optimum estimator cy. The advantage is that the solution of triangular systems of 
equations is trivial. 

We next provide a constructive proof of the LDL” decomposition by example and il- 
lustrate its application to the solution of the normal equations for M = 4. The generalization 
to an arbitrary order is straightforward and is given in Section 7.1.4. 


EXAMPLE 6.3.1. Writing the decomposition (6.3.1) explicitly for M = 4, we have 
Yi. 112 113 «114 1 0 0 0 é; O 0 0 1 lo Bo [35 
Ve 122 123 | _ if 1 0 0 f &> 0 0 | E 1 By a, | 
ie 132-133 | ie Iny 1 7 F 0 & O | E Oo 1 ‘al 
Ta. 142 143 144 bg 13; Ibo IJLO 0 O E41 LO 0 O 1 
(6.3.7) 


where rjj = re and €; > 0, by assumption. If we perform the matrix multiplications on the 
right-hand side of (6.3.7) and equate the matrix elements on the left and right sides, we obtain 


ry =§1 => §=ry 
r21 
r21 = &4l10 => lho= =, 
ro2 = Ey \hhol* + &2 => &=72-€ylhol? 
r3] 
731 = § ilo => Ing= ey 
732 — & ylolf 
732 = &yloolfy + €2/21 = y= aan ee 
= Eyllnql? + Eola? +€ => §3 = 133 — €y|lool? — fall 
33 = &1|l29 alla1 3 3 = 733 — §y|lo0 alli 
r4l 
r41 = § 1130 => 19 = a 
r42 — € 113017, 
r42 = § 1)30l 79 + §2/31 => 13) = ee 
143 — § 1130155 — €2l31 5 
143 = & 1130159 + €2!3115, + €3/32 = I= es 2 
rag = Ey\lg0l? + €alizil? + €3llgal? +€4 => &4 = 144 — Ey llg0l? — Eallgil? — &3llsal? 


(6.3.8) 


which provides a row-by-row computation of the elements of the LDL? decomposition. We 
note that the computation of the next row does not change the already computed rows. 
The lower unit triangular system in (6.3.5) becomes 


i 0! 20. Oy rege dy 
ho 1 0 0 ES _ | 
bs In 1 0 | Ee ee | 
Ib9 (131 132 LI Laka d4 


and can be solved by forward substitution, starting with the first equation. Indeed, we obtain 


(6.3.9) 


d| 
Sik) = dy > ky = 
51 


dy — 110 ky 
lo81k1 + €2k2 =d2 > kg = a 
d3 — I99§ ky + In € oko 
Io ky + lo Eoko + &3k3 = 3 > kg = z 2 
d4 — 139§ ky + 131€ oko + 1328 3k3 


139 1k, + 1316 2k2 + 1328 3k3 + €4k4 = dg > kg = 


&4 
(6.3.10) 
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which compute the coefficients k; in “forward” order. Then, the optimum estimator is obtained 
by solving the upper unit triangular system in (6.3.6) by backward substitution, starting from the 
last equation. Indeed, we have 


(4) (4) 
' No 40 Bo | ky = 
o 1 i & 1] k ch = ky — Pca 
21 31 2 = 2 a 3 32 (6.3.11) 


k 
= 3 20 | ef = ky Figen — Higes = Hgea 
that is, the coefficients of the optimum estimator are computed in “backward” order. As a result of 
this backward substitution, computing one more coefficient for the optimum estimator changes 
all the previously computed coefficients. Indeed, the coefficients of the third-order estimator are 


G) (3) _ 
’ No 20 [< ky ia 
, 1 5) ka =|) > cO =m -B (6.3.12) 
00 1 G) k3 3) _ GB) (3) 
C3 cp =k —liges ’ — gc 


which are different from the first three coefficients of the fourth-order estimator. 


Careful inspection of the formulas for 711, 722, 733, and r44 shows that the diagonal 


elements of R provide an upper bound for the elements of L and D, which is the reason for 
the good numerical properties of the LDL” decomposition algorithm. The general formulas 
for the row-by-row computation of the triangular decomposition, forward substitution, and 
backward substitution are given in Table 6.1 and can be easily derived by generalizing the 
results of the previous example. The triangular decomposition requires M?/6 operations, 
and the solution of each triangular system requires M(M + 1)/2 ~ M?/2 operations. 


TABLE 6.1 
Solution of normal equations using triangular decomposition. 


For i 


For i 


For i 


=1,2,...,Mandfor 7 =0,1,...,i-1, 


j-1 
1 ; 
lig = a (rst - tet (not executed when i = M) 


m=0 


i-1 
2 2 
§; = Vii > Emlli-1,m—11 


m=1 


= 1, 2, neg VM, 
i—2 
d i 
k= _ — Do ijt mkm+1 
E m=0 
=M,M-1,..., i, 
M 
G =k > y—1i—1em 
m=i+1 


The decomposition (6.3.1) leads to an interesting and practical formula for the com- 


putation of the MMSE without using the optimum estimator coefficients. Indeed, using 


(6.2 


.17), (6.3.6), and (6.3.1), we obtain 


P, = Py —c# Re, = P, —k#L-'RG!)*k = P, — kk” Dk (6.3.13) 


or in scalar form 


M 
Py = Py — SEK? (6.3.14) 
i=1 


since D is diagonal. Equation (6.3.14) shows that because €; > 0, increasing the order 
of the filter can only reduce the minimum error and hence leads to a better estimate. 
Another important application of (6.3.14) is in the computation of arbitrary positive definite 
quadratic forms. Such problems arise in various statistical applications, such as detection 
and hypothesis testing, involving the correlation matrix of Gaussian processes (McDonough 
and Whalen 1995). 

Since the determinant of a unit lower triangular matrix equals 1, from (6.3.1) we obtain 


M 
det R = (det L)(det D)(det L") |] &; (6.3.15) 
i=l 


which shows that if R is positive definite, §; > 0 for all i, and vice versa. 

The triangular decomposition of symmetric, positive definite matrices is numerically 
stable. The function [L,D]=1ldlt(R) implements the first part of the algorithm in Table 
6.1, and it fails only if matrix R is not positive definite. Therefore, it can be used as 
an efficient test to find out whether a symmetric matrix is positive definite. The function 
[co, Po]=lduneqgs (L,D,d) computes the MMSE estimator using the last formula in Table 
6.1 and the corresponding MMSE using (6.3.14). 

To summarize, linear MMSE estimation involves the following computational steps 


1.R= E{xx”},d= E{xy*} Normal equations Re, = d 


2.R =LDL” Triangular decomposition 
3. LDk =d Forward substitution — k 
H Mae ge (6.3.16) 
4.L"%c, =k Backward substitution — Cy 
5. P, = Py —k" Dk MMSE computation 
6.e=y- ct x Computation of residuals 


The vector k can also be obtained using the LDL” decomposition of an augmented corre- 
lation matrix. To this end, consider the augmented vector 


x 
x= (6.3.17) 
y 


Be eee 6.3.18 
ld? P, ce) 


We can easily show that the LDL” decomposition of R is 


7 L O/}|D 0 L? Kf 
R= lee Mak a (6.3.19) 


which provides the MMSE P, and the quantities L and k required to obtain the optimum 
estimator c, by solving L’¢, = k. 


and its correlation matrix 


H * 
oe ie } Elxy j 


E{yx"}  E{ly|?} 


EXAMPLE 6.3.2. Compute, using the LDL” method, the optimum estimator and the MMSE 
specified by the following second-order moments: 


1 3 2 4 1 
3 12 18 21 2 ] 
R= d= and Py = 100 
2 18 54 48 | 
E 21 48 55 4 
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Solution. We first compute the triangular factors 


oN CO OO 
oo 
——___| 


WN 
wo he 
NF Oo 
eS iS 
LN. 
— 
oo 
CoO WwW CO 


using (6.3.8), and the vector k 
k=[l — 41.75 -1f 
using (6.3.9). Then we determine the optimum estimator 
c= [34.5 — 1233.75 — 1]? 
by solving the triangular system (6.3.11). The corresponding MMSE 
Po = 88.5 


can be evaluated by using either (6.2.17) or (6.3.14). The reader can easily verify that the LDL? 
decomposition of R provides the elements of L, k, and Po. 


Since the diagonal elements &; are positive, the matrix 
cap!” (6.3.20) 
is lower triangular with positive diagonal elements. Then (6.3.1) can be written as 
R=CL" (6.3.21) 


which is known as the Cholesky decomposition of R (Golub and Van Loan 1996). The 
computation of L requires M?/6 multiplications and additions and M square roots and can 
be done by using the function L=chol (R)’. The function [L,D]=1dltchol(R) computes 
the LDL” decomposition using the function chol. 


6.4 OPTIMUM FINITE IMPULSE RESPONSE FILTERS 


In the previous section, we presented the theory of general linear MMSE estimators [see 
Figure 6.1(a)]. In this section, we apply these results to the design of optimum linear filters, 
that is, filters whose performance is the best possible when measured according to the MMSE 
criterion [see Figure 6.1(b)]. The general formulation of the optimum filtering problem is 
shown in Figure 6.10. The optimum filter forms an estimate }(n) of the desired response 
y(n) by using samples from a related input signal x(n). The theory of optimum filters 
was developed by Wiener (1942) in continuous time and Kolmogorov (1939) in discrete 
time. Levinson (1947) reformulated the theory for FIR filters and stationary processes and 
developed an efficient algorithm for the solution of the normal equations that exploits the 
Toeplitz structure of the autocorrelation matrix R (see Section 7.4). For this reason, linear 
MMSE filters are often referred to as Wiener filters. 


Desired FIGURE 6.10 
response Block diagram representation of the 
optimum filtering problem. 


Optimum 
filter 


We consider a linear FIR filter specified by its impulse response h(n, k). The output of 
the filter is determined by the superposition summation 


M-1 
5(n) = Do hn, k)x(n — k) (6.4.1) 

k=0 

M 
4 ye Ch (n)x(n—-k+1)4 ct (n)x(n) (6.4.2) 

k=1 
where e(n) = [e1(n) e2(n) --+ eu)" (6.4.3) 
and x(n) = [x(n) x — 1) +++ x —M +1)" (6.4.4) 


are the filter coefficient vector’ and the input data vector, respectively. Equation (6.4.1) 
becomes a convolution if h(n, k) does not depend on n, that is, when the filter is time- 
invariant. The objective is to find the coefficient vector that minimizes the MSE E{ |e(n)|7}. 

We prefer FIR over IIR filters because (1) any stable IIR filter can be approximated 
to any desirable degree by an FIR filter and (2) optimum FIR filters are easily obtained by 
solving a linear system of equations. 


6.4.1 Design and Properties 


To determine the optimum FIR filter ¢, (7), we note that at every time instant n, the optimum 
filter is the linear MMSE estimator of the desired response y(n) based on the data x(n). 
Since for any fixed n the quantities y(n), x(m),..., x(n” — M + 1) are random variables, 
we can determine the optimum filter either from (6.2.12) by replacing x by x(n), y by y(n), 
and ¢, by ¢,(n); or by applying the orthogonality principle (6.2.41). Indeed, using (6.2.41), 
(6.1.2), and (6.4.2), we have 


E{x(n)Ly* (n) — x" (n)eo(n)]} = 0 (6.4.5) 

which leads to the following set of normal equations 
R(n)e,(n) = d(n) (6.4.6) 
where R(n) & E{x(n)x™ (n)} (6.4.7) 


is the correlation matrix of the input data vector and 

d(n) = E{x(n)y*(n)} (6.4.8) 
is the cross-correlation vector between the desired response and the input data vector, that 
is, the input values stored currently in the filter memory and used by the filter to estimate 
the desired response. We see that, at every time n, the coefficients of the optimum filter are 
obtained as the solution of a linear system of equations. The filter ¢,(n) is optimum if and 
only if the Hermitian matrix R(7) is positive definite. 


To find the MMSE, we can use either (6.2.17) or the orthogonality principle (6.2.41). 
Using the orthogonality principle, we have 


P,(n) = Efe,(n)Ly*(n) — x" (neon) ]} 
= Ef{e,(n)y*(n)} due to orthogonality 
= E{[y(n) — x" (n)eo(n)ly*(n)} 


"We define cx+1 (1) £ h*(n,k), 0 < k < M — 1 inorder to comply with the definition R(n) £ E{x(n)x" (n)} 
of the correlation matrix. 
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which can be written as 
Po(n) = Py(n) — d” (n)eo(n) (6.4.9) 
The first term 
Py(n) = Efly(n)|7} (6.4.10) 


is the power of the desired response signal and represents the MSE in the absence of 
filtering. The second term d4 (n)e,(n) is the reduction in the MSE that is obtained by using 
the optimum filter. 

In many practical applications, we need to know the performance of the optimum filter 
in terms of MSE reduction prior to computing the coefficients of the filter. Then we can 
decide if it is preferable to (1) use an optimum filter (assuming we can design one), (2) use 
a simpler suboptimum filter with adequate performance, or (3) not use a filter at all. Hence, 
the performance of the optimum filter can serve as a yardstick for other competing methods. 

The optimum filter consists of (1) a linear system solver that determines the optimum set 
of coefficients from the normal equations formed, using the known second-order moments, 
and (2) a discrete-time filter that computes the estimate )(n) (see Figure 6.11). The solution 
of (6.4.6) can be obtained by using standard linear system solution techniques. In MATLAB, 
we solve (6.4.6) by copt=R\d and compute the MMSE by Popt=Py-dot (conj(d),copt). 
The optimum filter is implemented by yest=filter (copt,1,x). We emphasize that the 
optimum filter only needs the input signal for its operation, that is, to form the estimate of 
y(n); the desired response, if it is available, may be used for other purposes. 


Optimum 
estimate 


Linear system 
solver 
Riye,(n) = d(n) 


A priori 
information 


R(n) d(n) 


FIGURE 6.11 
Design and implementation of a time-varying optimum FIR filter. 


Conventional frequency-selective filters are designed to shape the spectrum of the 
input signal within a specific frequency band in which it operates. In this sense, these 
filters are effective only if the components of interest in the input signal have their energy 
concentrated within nonoverlapping bands. To design the filters, we need to know the limits 
of these bands, not the values of the sequences to be filtered. Note that such filters do not 
depend on the values of the data (values of the samples) to be filtered; that is, they are not 
data-adaptive. In contrast, optimum filters are designed using the second-order moments 
of the processed signals and have the same effect on all classes of signals with the same 
second-order moments. Optimum filters are effective even if the signals of interest have 


overlapping spectra. Although the actual data values also do not affect optimum filters, that 
is, they are also not data-adaptive, these filters are optimized to the statistics of the data and 
thus provide superior performance when judged by the statistical criterion. 

The dependence of the optimum filter only on the second-order moments is a conse- 
quence of the linearity of the filter and the use of the MSE criterion. Phase information about 
the input signal or non-second-order moments of the input and desired response processes is 
not needed; even if the moments are known, they are not used by the filter. Such information 
is useful only if we employ a nonlinear filter or use another criterion of performance. 

The error performance surface of the optimum direct-form FIR filter is a quadratic 
function of its impulse response. If the input correlation matrix is positive definite, this 
function has a unique minimum that determines the optimum set of coefficients. The surface 
can be visualized as a bowl, and the optimum filter corresponds to the bottom of the bowl. 
The bottom is moving if the processes are nonstationary and fixed if they are stationary. In 
general, the shape of the error performance surface depends on the criterion of performance 
and the structure of the filter. Note that the use of another criterion of performance or another 
filter structure may lead to error performance surfaces with multiple local minima or saddle 
points. 


6.4.2 Optimum FIR Filters for Stationary Processes 


Further simplifications and additional insight into the operation of optimum linear filters 
are possible when the input and desired response stochastic processes are jointly wide-sense 
stationary. In this case, the correlation matrix of the input data and the cross-correlation 
vector do not depend on the time index n. Therefore, the optimum filter and the MMSE are 
time-invariant (i.e., they are independent of the time index n) and are determined by 


Re, = d (6.4.11) 
and P, = Py —d"e, (6.4.12) 
Owing to stationarity, the autocorrelation matrix is 
r (0) ry (1) pte SM 1) 
Pe ee) rx (0) vos 1y(M — 2) 
R=]. ; ; ; (6.4.13) 
re(M—1) re(M—2) +++ rx(0) 


determined by the autocorrelation r,(J) = E{x(n)x*(n —1)} of the input signal. The cross- 
correlation vector between the desired response and the input data vector is 


d © [di dp «+» dul’ 4 Ur.) 3,0) + r= DI" (6.4.14) 


and Py is the power of the desired response. For stationary processes, the matrix R is Toeplitz 
and positive definite unless the components of the data vector are linearly dependent. 
Since the optimum filter is time-invariant, it is implemented by using convolution 
M-1 
Fo(n) = Y> holk) x(n —k) (6.4.15) 
k=0 
where hig(n) =c? 4, is the impulse response of the optimum filter. 
Using (6.4.13), (6.4.14), ho(n) = ce at and r(/) = r*(—l), we can write the normal 
equations (6.4.11) more explicitly as 
M-1 
So holk)rs(m —k) = ryx(m) OS m<M-1 (6.4.16) 
k=0 
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which is the discrete-time counterpart of the Wiener-Hopf integral equation, and its solution 
determines the impulse response of the optimum filter. We notice that the cross-correlation 
between the input signal and the desired response (right-hand side) is equal to the convolu- 
tion between the autocorrelation of the input signal and the optimum filter (left-hand side). 
Thus, to obtain the optimum filter, we need to solve a convolution equation. 
The MMSE is given by 
M-1 
Po = Py— = ho(k)ryy (k) (6.4.17) 
k=0 


which is obtained by substituting (6.4.14) into (6.4.12). Table 6.2 summarizes the informa- 
tion required for the design of an optimum (in the MMSE sense) linear time-invariant filter, 
the Wiener-Hopf equations that define the filter, and the resulting MMSE. 


TABLE 6.2 
Specification of optimum linear filters for stationary signals. The limits 0 
and M — 1 on the summations can be replaced by any values M, and M2. 


M-1 
Filter and Error Definitions e(n) = y(n) — x h(k)x(n — k) 
k=0 
Criterion of Performance PLE {le(n)|7} —> minimum 
M-1 
Wiener-Hopf Equations >, ho(k)rx(m — k) =ryx(m),0<m<M—1 
k=0 
M-1 
Minimum MSE Po = Py — > ho(k)ryy(k) 
k=0 
Second-Order Statistics ry (1) = E{x(n)x*(n —D}, Py = E{ly(n)|7} 


ryxQ) = Ely(a)x*(n —D} 


To summarize, for nonstationary processes R(n) is Hermitian and nonnegative definite, 
and the optimum filter h, (7) is time-varying. For stationary processes, R is Hermitian, non- 
negative definite, and Toeplitz, and the optimum filter is time-invariant. A Toeplitz autocor- 
relation matrix is positive definite if the power spectrum of the input satisfies R,(e/®) > 0 
for all frequencies w. In both cases, the filter is used for all realizations of the processes. If 
M = o, we have a causal IIR optimum filter determined by an infinite-order linear system 
of equations that can only be solved in the stationary case by using analytical techniques 
(see Section 6.6). 


EXAMPLE 6.4.1. Consider a harmonic random process 
y(n) = Acos (won + ¢) 


with fixed, but unknown, amplitude and frequency, and random phase @, uniformly distributed 
on the interval from 0 to 27. This process is corrupted by additive white Gaussian noise v(n) ~ 
NO, 02) that is uncorrelated with y(n). The resulting signal x(n) = y(n) + v(7) is available to 
the user for processing. Design an optimum FIR filter to remove the corrupting noise v(n) from 
the observed signal x (7). 


Solution. The input of the optimum filter is x(m), and the desired response is y(n). The signal 
y(n) is obviously unavailable, but to design the filter, we only need the second-order moments 
rx (I) and ryx (J). We first note that since y(n) and v(n) are uncorrelated, the autocorrelation of 


the input signal is 
rx) =ryD+rv@ = 5A? cos wal + 025(I) 


where ry(/) = 5A? COS wo! is the autocorrelation of y(7). The cross-correlation between the 
desired response y(n) and the input signal x (7) is 


ryx(Q) = Efy@)[ya—D) + va — Dy} = ryD) 


Therefore, the autocorrelation matrix R is symmetric Toeplitz and is determined by the elements 
r(0),r(),...,7(M — 1) of its first row. The right-hand side of the Wiener-Hopf equations is 
d = [ry(O) ry() --- ry(M@— Dy. If we know ry(/) and ae, we can numerically determine 
the optimum filter and the MMSE from (6.4.11) and (6.4.12). For example, suppose that A = 
0.5, fo = @9/(27) = 0.05, and o = 0.5. The input signal-to-noise ratio (SNR) is 


A?/2 
2 


Vv 


SNR; = 10 log = —6.02 dB 


The processing gain (PG), defined as the ratio of signal-to-noise ratios at the output and input 
of a signal processing system 


A SNRO 
SNR] 


provides another useful measure of performance. 

The first problem we encounter is how to choose the order M of the filter. In the absence of 
any a priori information, we compute hy and ph for 1 < M < Mmax = 50 and PG and plot both 
results in Figure 6.12. We see that an M = 20 order filter provides satisfactory performance. 
Figure 6.13 shows a realization of the corrupted and filtered signals. Another useful approach to 
evaluate how well the optimum filter enhances a harmonic signal is to compute the spectra of 
the input and output signals and the frequency response of the optimum filter. These are shown 
in Figure 6.14, where we see that the optimum filter has a sharp bandpass about frequency fo, 
as expected (for details see Problem 6.5). 
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FIR filter order M FIR filter order M 
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FIGURE 6.12 


Plots of (a) the MMSE and (b) the processing gain as a function of the filter order M. 


To illustrate the meaning of the estimator’s optimality, we will use a Monte Carlo simulation. 
Thus, we generate K = 100 realizations of the sequence x(¢;,2),0 <n < N — 1(N = 1000); 
we compute the output sequence 9(¢; i n), using (6.4.15); and then the error sequence e(¢;,n) = 
y(¢;,n) — ¥(f;, n) and its variance P(e; ). Figure 6.15 shows a plot of Pc; ), 1 <o;) < K.We 
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Example of the noise-corrupted and filtered sinusoidal signals. 
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PSD of the input signal, magnitude response of the optimum filter, and PSD of the 
output signal. 
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FIGURE 6.15 

Results of Monte Carlo simulation of the optimum filter. The 
solid line corresponds to the MMSE and the dashed line to the 
average of P(c;) values. 


notice that although the filter performs better or worse than the optimum in particular cases, on 
average its performance is close to the theoretically predicted one. This is exactly the meaning 
of the MMSE criterion: optimum performance on the average (in the MMSE sense). 


For a certain realization, the optimum filter may not perform as well as some other 
linear filters; however, on average, it performs better than any other linear filter of the same 
order when all possible realizations of x(n) and y(n) are considered. 


6.4.3 Frequency-Domain Interpretations 
We will now investigate the performance of the optimum filter, for stationary processes, 


in the frequency domain. Using (6.2.7), (6.4.13), and (6.4.14), we can easily show that the 
MSE of an FIR filter h(n) is given by 


M-1 M-1 M-1M-1 
P = Eflen)|"} =ryO)— D> Ar.) SO A Wryx()+ Y) Yo h@rd-Ha*O 
k=0 k=0 k=0 1=0 
(6.4.18) 
The frequency response function of the FIR filter is 
M-1 
H(el®) Sake (6.4.19) 
k=0 
Using Parseval’s theorem, 
lee) 1 1 
2 x1 (N)x5(n) = =| X1(e!°) X35 (e!%) dw (6.4.20) 
n=—0o Ht 


we can show that the MSE (6.4.18) can be expressed in the frequency domain as 


1 wk : ; } : ‘ 5 ; 
P= rO)-5- | [H (e!®) RY, (e!°) +H *(e!®) Ryx (e/°)—H (e!®) H* (e!®) Ry (e!) da 
) ze ; ) 
(6.4.21) 
where R,(e/”) is the PSD of x(n) and Ryx (e/”) is the cross-PSD of y(n) and x(n) (see 
Problem 6.10). This formula holds for both FIR and IIR filters. 


If we minimize (6.4.21) with respect to H (e/”), we obtain the system function of the 
optimum filter and the MMSE. However, we leave this for Problem 6.11 and instead express 


—T 
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(6.4.17) in the frequency domain by using (6.4.20). Indeed, we have 


us 
Py = ry(0) — = i H,(e!®)R*, (e!®) dw 
KH 
a ae (6.4.22) 
ao ixce - He Re (e/°)| da 
where H,(e/”) is the frequency response of the optimum filter. The above equation holds 
for any filter, FIR or IIR, as long as we use the proper limits to compute the summation in 
(6.4.19). 

We will now obtain a formula for the MMSE that holds only for IIR filters whose 
impulse response extends from —oo to oo. In this case, (6.4.16) is a convolution equation 
that holds for —oo < m < oo. Using the convolution theorem of the Fourier transform, we 
obtain 


Ry, (e/”) 

H,(e/®) = 6.4.23 

oe!) = Fay (6.4.23) 

which, we again stress, holds for noncausal IIR filters only. Substituting into (6.4.22), we 
obtain 

1 a4 R joy|2 , 
P, = if 1 — rx 1Ry(e/®) do 
20 Jn Ry (e/®) Rx (J) 
1 u F : 
or R= — i [1 — Gyx(e/®) Ry (e!°) dw (6.4.24) 
2m Jen 


where Gy (e/”) is the coherence function between x(n) and y(n). 

This important equation indicates that the performance of the optimum filter depends 
on the coherence between the input and desired response processes. As we recall from 
Section 5.4, the coherence is a measure of both the noise disturbing the observations and 
the relative linearity between x(n) and y(n). The optimum filter can reduce the MMSE 
at a certain band only if there is significant coherence, that is, Gyx (ef) ~ 1. Thus, the 
optimum filter H,(z) constitutes the best, in the MMSE sense, linear relationship between 
the stochastic processes x(n) and y(n). These interpretations apply to causal IIR and FIR 
optimum filters, even if (6.4.23) and (6.4.24) only hold approximately in these cases (see 
Section 6.6). 


6.5 LINEAR PREDICTION 


Linear prediction plays a prominent role in many theoretical, computational, and practical 
areas of signal processing and deals with the problem of estimating or predicting the value 
x(n) of a signal at the time instant n = no, by using a set of other samples from the 
same signal. Although linear prediction is a subject useful in itself, its importance in signal 
processing is also due, as we will see later, to its use in the development of fast algorithms 
for optimum filtering and its relation to all-pole signal modeling. 


6.5.1 Linear Signal Estimation 
Suppose that we are given a set of values x(n), x(n — 1),...,x(n — M) of a stochastic 


process and we wish to estimate the value of x(m — i), using a linear combination of the 
remaining samples. The resulting estimate and the corresponding estimation error are given 


by 287 


M SECTION 6.5 
x(n —i) 4 > ce (n) x(n —k) (6.5.1) Linear Prediction 
k=0 
k#i 
ane e(n) 2 x(n i) — 8 — 3) 


M, (6.5.2) 
= > ci(n)x(n—k) with cj(n) £1 
k=0 
where c;(n) are the coefficients of the estimator as a function of discrete-time index n. The 
process is illustrated in Figure 6.16. 


Linear signal estimation 
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FIGURE 6.16 

Illustration showing the samples, estimates, and errors used in 
linear signal estimation, forward linear prediction, and 
backward linear prediction. 


To determine the MMSE signal estimator, we partition (6.5.2) as 


i-1 M 
eO(n) =) cf()x(n—k) + x(n-i)+ D> cf(n)x(n-b) 
k=0 k=i+l 


(6.5.3) 
cf (n)xi(n) + x(n — i) + cf (n)xo(n) 


[e 17x) 


where the partitions of the coefficient and data vectors, around the ith component, are easily 
defined from the context. To obtain the normal equations and the MMSE for the optimum 


I 
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linear signal estimator, we note that 


; é x1 (1) 
Desired response = x(n — i) data vector = 
x2(n) 


Using (6.4.6) and (6.4.9) or the orthogonality principle, we have 


Ri elle atestee ean 
Rjo(@2) Roan) | | ea) r2(n) 
or more compactly’ 
RO (Aye? (n) = —d (n) (6.5.5) 
and PO(n) = Py(n — i) +f! (Nei (n) + r4/ (n)er(n) (6.5.6) 
where for j,k = 1,2 
Rye(n) = E{x;(n)xf (n)} (6.5.7) 
rj(n) = E{x;(n)x*(n — i)} (6.5.8) 
Py(n) = E{\x(n)|} (6.5.9) 


For various reasons, to be seen later, we will combine (6.5.4) and (6.5.6) into a single 
equation. To this end, we note that the correlation matrix of the extended vector 


x1 (1) 
X(n) = | x(n-i) (6.5.10) 
x2(”) 
can be partitioned as 
Rul) rim) Rj2(n) 
R(n) = E{x(n)x" (n)} = | r4(n) Pym — i) rn) (6.5.11) 
Rijn) r2(n) Ro2(n) 
with respect to its ith row and ith column. Using (6.5.4), (6.5.6), and (6.5.11), we obtain 
0 
Rn (n) = | POC) | < ith row (6.5.12) 
0 


which completely determines the linear signal estimator ¢ (n) and the MMSE p& (n). 

If M = 2L and i = L, we have a symmetric linear smoother ¢(n) that produces an 
estimate of the middle sample by using the L past and the L future samples. The above 
formulation suggests an easy procedure for the computation of the linear signal estima- 
tor for any value of i, which is outlined in Table 6.3 and implemented by the func- 
tion[ci,Pi]=olsigest (R,i). We next discuss two types of linear signal estimation that 
are of special interest and have their own dedicated notation. 


6.5.2 Forward Linear Prediction 


One-step forward linear prediction (FLP) involves the estimation or prediction of the value 
x(n) of a stochastic process by using a linear combination of the past samples x(n — 
1),..., x(n — M) (see Figure 6.16). We should stress that in signal processing applications 


"The minus sign on the right-hand side of the normal equations is the result of arbitrarily setting the coefficient 
c(n) £1. 


TABLE 6.3 289 


Steps for the computation of optimum signal estimators. eee eae 


. 2.6 _ Linear Prediction 
1. Determine the matrix R(n) of the extended data vector x(n). 


2. Create the M x M submatrix R“ )(n) of R(n) by removing its ith row and its ith column. 

3. Create the M x 1 vector d (n) by extracting the ith column a (n) of R(n) and removing its ith element. 
4. Solve the linear system RO (nen) = —a) (n) to obtain ol) (n). 

5. Compute the MMSE P.” (n) = [d (ny 46 (n). 


of linear prediction, what is important is the ability to obtain a good estimate of a sample, 
pretending that it is unknown, instead of forecasting the future. Thus, the term prediction 
is used more with signal estimation than forecasting in mind. The forward predictor is a 
linear signal estimator with i = O and is denoted by 


M 
e'(n) &= x(n) + y ax (n)x(n — k) 


(6.5.13) 
= x(n) + a” (n)x(n — 1) 
where a(n) = [a,(n) a(n) --- am(n)]" (6.5.14) 


is known as the forward linear predictor and ax(n) with ag(n) = 1 as the FLP error filter. 
To obtain the normal equations and the MMSE for the optimum FLP, we note that fori = 0, 
(6.5.11) can be written as 


: Py(n) rv (n) 
Rin) =| | (6.5.15) 
5 (n) Ra | 
where R(n) = E{x(n)x" (n)} (6.5.16) 
and r'(n) = E{x(n — 1)x*(n)} (6.5.17) 
Therefore, (6.5.5) and (6.5.6) give 
R(n — 1)a,(n) = —r' (n) (6.5.18) 
and Pin) = P,(n) + r“# (n)ag(n) (6.5.19) 
or R(n) ! = Ki ”| (6.5.20) 
a,(n) 0 


which completely specifies the FLP parameters. 


6.5.3 Backward Linear Prediction 


In this case, we want to estimate the sample x(n — M) in terms of the future samples 
x(n), x(n —1),...,x(n — M + 1) (see Figure 6.16). The term backward linear prediction 
(BLP) is not accurate but is used since it is an established convention. A more appropriate 
name might be postdiction or hindsight. The BLP is basically a linear signal estimator with 
i = M and is denoted by 


M-1 


e>(n) S 2 by(n)x(n — k) + x(n — M) (6.5.21) 


= b” (n)x(n) + x(n — M) 
where b(n) & [bo(n) b(n) «++ by—1(n) |" (6.5.22) 
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TABLE 6.4 


is the BLP and by (n) with by(n) = 1 is the backward prediction error filter (BPEF). For 


i = M, (6.5.11) gives 
: Ri) rn) 
R(n) = (6.5.23) 
r-H#(n)  P,(n— M) 


where r°(n) & E{x(n)x*(n — M)} (6.5.24) 
The optimum backward linear predictor is specified by 
R(n)b,(n) = —r°(n) (6.5.25) 
and the MMSE is 
P>(n) = P,(n — M) + r°4 (n)bo(n) (6.5.26) 


and can be put in a single equation as 


Rl lel: (6.5.27) 
1 Pn) 


In Table 6.4, we summarize the definitions and design equations for optimum FIR filtering 
and prediction. Using the entries in this table, we can easily obtain the normal equations 
and the MMSE for the FLP and BLP from those of the optimum filter. 


Summary of the design equations for optimum FIR filtering and prediction. 


Optimum filter FLP BLP 
Input data vector x(n) x(n — 1) x(n) 
Desired response y(n) x(n) x(n — M) 
Coefficient vector h(n) a(n) b(n) 
Estimation error e(n) = y(n) — c# (n)x(n) ef(n) = x(n) +a%(n)x(n—1)— e®(n) = x(n — M) + b(n) x(n) 
Normal equations R(n)eo(n) = d(n) R(n — 1)ag(n) = —rf (n) R(n)bo(n) = —r>(n) 


MMSE 


Required moments 


Stationary processes 


PS(n) = Py(n)—c# (n)d(n) PE (n) = Py(n) +a (neh (n) PP(n) = Py(n — M) + b* (n)r?n) 


R(n) = E{x(n)x" (n)} ri (n) = E{x(n — 1)x*(n)} r°(n) = E{x(n)x*(n — M)} 
d(n) = E{x(n)y*(n)} 


Rey = d, R is Toeplitz Ra, = —r* Rb, = —Jr > by = Jas 


6.5.4 Stationary Processes 


If the process x(7) is stationary, then the correlation matrix R(n) does not depend on the 
time n and it is Toeplitz 


r(0) r() -++ F(M) 

_ r*(_1)r(0) “++ r(M — 1) 

R=|. a) (6.5.28) 
r*(M) r*(M—1) +--+ r(0) 


Therefore, all the resulting linear MMSE signal estimators are time-invariant. If we define 


the correlation vector 
r= [r(1)r(2) +++ ry" (6.5.29) 


where r(/) = E{x(n)x*(n —1)}, we can easily see that the cross-correlation vectors for the 
FLP and the BLP are 


ri = E{x(n — 1)x*(n)} = r* (6.5.30) 

and r> = E{x(n)x*(n — M)} = Jr (6.5.31) 
0 0 1 

where J= 7 : Jiya=sy' =! (6.5.32) 
0 1 0 
10 .--. 0 

is the exchange matrix that simply reverses the order of the vector elements. Therefore, 

Ra, = —1* (6.5.33) 

P! =r) +r"a, (6.5.34) 

Rb, = —Jr (6.5.35) 

p> —r(0) +r" Jb, (6.5.36) 


where the Toeplitz matrix R is obtained from R by deleting the last column and row. Using 
the centrosymmetry property of symmetric Toeplitz matrices 


RJ = JR* (6.5.37) 
and (6.5.33), we have 
JR*ai=-Jr or RJae=—Jr (6.5.38) 
Comparing the last equation with (6.5.35), we have 
by = Ja; (6.5.39) 


that is, the BLP coefficient vector is the reverse of the conjugated FLP coefficient vector. 
Furthermore, from (6.5.34), (6.5.36), and (6.5.39), we have 


P, = pt = p> (6.5.40) 


that is, the forward and backward prediction error powers are equal. 

This remarkable symmetry between the MMSE forward and backward linear predictors 
holds for stationary processes but disappears for nonstationary processes. Also, we do not 
have such a symmetry if a criterion other than the MMSE is used and the process to be 
predicted is non-Gaussian (Weiss 1975; Lawrence 1991). 


EXAMPLE 6.5.1. To illustrate the basic ideas in FLP, BLP, and linear smoothing, we consider the 
second-order estimators for stationary processes. 
The augmented equations for the first-order FLP are (7 (0) is always real) 


r0) ray] fag? | fee 
ray rOJ} fa] Lo 


and they can be solved by using Cramer’s rule. Indeed, we obtain 


ie oe 
(O_o ©) _ OPE _ |, pr _ det, _ PP — ryP 


0 det Ry ~ det Ro 1 det Ry r(0) 


291 


SECTION 6.5 
Linear Prediction 


292 


r) Pt 
CHAPTER 6 det ie £2% 
Optimum Linear Filters and a= rm@) 0 J —Pir 7) = rd) 
1 det Ry ~  detRo ~~ r(0) 


for the MMSE and the FLP. For the second-order case we have 


ae) ; 

r@) rl) r(2) re 528 

ry) r@ ry} la | = ]o 

r*(2) r*(1)_ (0) || 0 
22 


whose solution is 


(2) _ PydetRo _ f __ detR3 
0 ~~ detR3 ie det Ry 
* * 
_ pi ae as: r(1) =e oo r(1) 
6) r*(2) (0) r*(2) r)} — r(yr*@) —r@)r*(1) 
a = — = 
1 det R3 det Ro r2(0) = |r()/? 
and 
* * 
pf det Zw) 2) 5 rw 1) 
Q) _ 2) FQ} [QM] fr*MPr -rOr*Q) 
fe = det R3 a det Rp ~~ 72(0) — |r) [2 
Similarly, for the BLP 
r(0) ry] a6” 0 
ray r@){ 6] Lee 


where a = 1, we obtain 


b_ det Ro ar ph = _ rd) 

1 det Ry 9 r(0) 
rp2) 

0) -r(l)—r@) 0] 0 

ry rQ) ry] |e | =] 0 

r*(2) r*(1)_ (0) > PP 
Lb; 


p> — det R3 p2 — r*(1)r(2) — r()rd) (2) r2(1) — r@)rQ2) 


7 detR, | (0) — rar 9 2) =r? 
We note that pi = p> on = a 
and Baie Vai eee 


which is a result of the stationarity of x(m) or equivalently of the Toeplitz structure of Rm. 
For the linear signal estimator, we have 


2) 
r(0) (1) r(2)] | 0 0 
ry rO) rr} fe] =] Pp 
r*(2) r*(1) r0)} | Qy 0 
© 


(2) 


with c;” = 1. Using Cramer’s rule, we obtain 
det R3 
Ro Hee 
det RO 
1 2 1 2 
_ Py det boa r(2) lee r(2) 
i EN 2 MQ 1@)_ rWr@ -rOra@ 
: det R3 det RO P20) = Ir? 
r(O)—r(1) ry r(1) 
Pas o lee a * 
OV 2 PSE Ny = MQ MO} _ rr) -r@r*d) 
‘ det R3 det R® 2) = Ira)? 
from which we see that Ce = on: that is, we have a linear phase estimator. 


6.5.5 Properties 


Linear signal estimators and predictors have some interesting properties that we discuss 
next. 


PROPERTY 6.5.1. If the process x () is stationary, then the symmetric, linear smoother has linear 
phase. 


Proof. Using the centrosymmetry property RJ = JR* and (6.5.12) for M = 2L,i = L, we 


obtain 
t= Je (6.5.41) 


that is, the symmetric, linear smoother has even symmetry and, therefore, has linear phase (see 
Problem 6.12). 


PROPERTY 6.5.2. If the process x(n) is stationary, the forward prediction error filter (PEF) 


1, a}, a2, ..., @y is minimum-phase and the backward PEF bo, bj, ..., byy_—1, 1 is maximum- 


phase. 


Proof. The system function of the Mth-order forward PEF can be factored as 


M 
A(z) =1+ Siac * = G(2) — gz) 


k=1 
where q is a zero of A(z) and 
M-1 
G@=1+ Do xe 
k=1 


is an (M — 1)st-order filter. The filter A(z) can be implemented as the cascade connection of the 
filters G(z) and 1 — qz (see Figure 6.17). The output s(n) of G(z) is 


s(n) = x(n) + gix(2— 1) +--+ gy_ix(a— M +1) 
and it is easy to see that 


E{s(n —1)e!*(n)} = 0 (6.5.42) 


FIGURE 6.17 
The prediction error filter with one 
zero factored out. 


el(n) 
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because E{x(n — kKel*(n)} = 0 for 1 < k < M. Since the output of the second filter can be 
expressed as 


e! (n) = s(n) — qs(n— 1) 
we have 
E{s(n — Nel*(n)} = E{s(n — 1)s*(n)} — g* E{s(n — I)s*(n — 1)} =0 
which implies that 


é rs(-1) 
~ rs(0) 


because q is equal to the normalized autocorrelation of s (1). If the process x (7) is not predictable, 
that is, E{ef (n)|7} 4 0, we have 


=> |q| <1 


Efile! (n)?} = Efel (n)[s*(n) — q*s*( — 1} 
= E{e(n)s*(n)} due to (6.5.42) 
= E{[s(n) — qs(n — 1)]s*(n)} 
= rs(0)(1 —|q\*) #0 
which implies that 
lq) <1 


that is, the zero g of the forward PEF filter is strictly inside the unit circle. Repeating this process, 
we can show that all zeros of A(z) are inside the unit circle; that is, A(z) is minimum-phase. 
This proof was presented in Vaidyanathan et al. (1996). The property b = Ja* is equivalent to 


which implies that B(z) is a maximum-phase filter (see Section 2.4). 


PROPERTY 6.5.3. The forward and backward prediction error filters can be expressed in terms 
of the eigenvalues 4; and the eigenvectors q; of the correlation matrix R(n) as follows 


1 . M+1 1 
api Gc aut (6.5.43) 
iow : B ps 
M+1 
bo(n) La 
and ke = P?(n) Ds 7 44. M41 (6.5.44) 
i=l “! 


where qj 1 and q;, +1 are the first and last components of q;. The first equation of (6.5.43) and 
the last equation in (6.5.44) can be solved to provide the MMSEs pf (n) and p> (n), respectively. 


Proof. See Problem 6.13. 


PROPERTY 6.5.4. Let R~! (1) be the inverse of the correlation matrix R(). Then, the inverse of 
the ith element of the ith column of R~! (n) is equal to the MMSE Pn), and the ith column 
normalized by the ith element is equal to ce (n). 


Proof. See Problem 6.14. 
PROPERTY 6.5.5. The MMSE prediction errors can be expressed as 


det R(n) 
det R(n — 1) 


det R(n) 


f = a a 
Fa det R(n) 


p(n) = (6.5.45) 


Proof. Problem 6.17. 


The previous concepts are illustrated in the following example. 


EXAMPLE 6.5.2. A random sequence x(n) is generated by passing the white Gaussian noise 295 
process w(n) ~ WN(0, 1) through the filter SECTION 6.6 
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x(n) = w(n) + Swin —1) 
Determine the second-order FLP, BLP, and symmetric linear signal smoother. 
Solution. The complex power spectrum is 
RQ) = H@H@!) = At $2) + $2) = z+ 24 5e7} 
Therefore, the autocorrelation sequence is equal to r(0) = 3, r(+l) = i, r(l) = 0 for |/| > 2. 


Since the power spectrum R (e/®) = 3 +cosq@ > 0 for all w, the autocorrelation matrix is 
positive definite. The same is true of any principal submatrix. To determine the second-order 
linear signal estimators, we start with the matrix 


ll 

ll 
a se | 
i=) Nie Bin 
NIB AlW Nie 
BIW Ne Oo 
LT 


and follow the procedure outlined in Section 6.5.1 or use the formulas in Table 6.3. The results 
are 


Forward linear prediction (i = 0): {ax} — {1, —0.476, 0.190} pf = 1.0119 
Symmetric linear smoothing (i = 1): {cx} > {-0.4, 1, —0.4} PS = 0.8500 
Backward linear prediction (i = 2): {bx} — {0.190, —0.476, 1} p> = 1.0119 


The inverse of the correlation matrix R is 


| 0.9882 —0.4706 mel 
R-!=| -0.4706 1.1765 —0.4706 
0.1882 —0.4706 0.9882 


and we see that dividing the first, second, and third columns by 0.9882, 1.1765, and 0.9882 
provides the forward PEF, the symmetric linear smoothing filter, and the backward PEF, respec- 
tively. The inverses of the diagonal elements provide the MMSEs Pi , PS, and pe. The reader 
can easily see, by computing the zeros of the corresponding system functions, that the FLP is 
minimum-phase, the BLP is maximum-phase, and the symmetric linear smoother is mixed-phase. 
It is interesting to note that the smoother performs better than either of the predictors. 


6.6 OPTIMUM INFINITE IMPULSE RESPONSE FILTERS 


So far we have dealt with optimum FIR filters and predictors for nonstationary and stationary 
processes. In this section, we consider the design of optimum IIR filters for stationary 
stochastic processes. For nonstationary processes, the theory becomes very complicated. 
The Wiener-Hopf equations for optimum IIR filters are the same for FIR filters; only the 
limits in the convolution summation and the range of values for which the normal equations 
hold are different. Both are determined by the limits of summation in the filter convolution 
equation. We can easily see from (6.4.16) and (6.4.17), or by applying the orthogonality 
principle (6.2.41), that the optimum IIR filter 


5(n) = Do holk)x(n — k) (6.6.1) 
k 


is specified by the Wiener-Hopf equations 


> ho(k)ry(m — k) = ryx(m) (6.6.2) 
k 
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and the MMSE is given by 


Py = ry(0) — Y > ho(k)r?, (k) (6.6.3) 
k 


where r,(/) is the autocorrelation of the input stochastic process x(n) and ry,(/) is the 
cross-correlation between the desired response process y(n) and x(n). We assume that the 
processes x(n) and y(n) are jointly wide-sense stationary with zero mean values. 

The range of summation in the above equations includes all the nonzero coefficients 
of the impulse response of the filter. The range of k in (6.6.1) determines the number of 
unknowns and the number of equations, that is, the range of m. For IIR filters, we have an 
infinite number of equations and unknowns, and thus only analytical solutions for (6.6.2) 
are possible. The key to analytical solutions is that the left-hand side of (6.6.2) can be 
expressed as the convolution of h,(m) with r;(m), that is, 


ho(m) * ry (Mm) = ryy (mn) (6.6.4) 


which is a convolutional equation that can be solved by using the z-transform. The com- 
plexity of the solution depends on the range of m. 

The formula for the MMSE is the same for any filter, either FIR or IIR. Indeed, using 
Parseval’s theorem and (6.6.3), we obtain 


1 ae 
Po = 1400) = 5 f Hal R5, () de 66.5) 
Cc 


where H,(z) is the system function of the optimum filter and Ryy(z) = Z{ryx(J)}. The 
power P, can be computed by 


1 
Py = 0) = 5 f Ry(2e7! dz (6.6.6) 


where Ry(z) = Z{ry(/)}. Combining (6.6.5) with (6.6.6), we obtain 


= 1 * 1 —l 
Po = nj Pik) -_ H(z) Ry x (=) dz (6.6.7) 


which expresses the MMSE in terms of z-transforms. To obtain the MMSE in the frequency 
domain, we replace z by e/®. For example, (6.6.5) becomes 


1 # . . 
Py = ry(0) — = Hy (e!®) RS, (e!°) dw 
— 


where H,(e/”) is the frequency response of the optimum filter. 


6.6.1 Noncausal IIR Filters 
For the noncausal IR filter 


5(n) = Yo hnclk)x(n = k) (6.6.8) 


k=—0o 
the range of the Wiener-Hopf equations (6.6.2) is —oo < m < oo and can be easily solved 
by using the convolution property of the z-transform. This gives 
Ane (2) Rx (Z) = Ryx (z) 


Ryx (z) 
Ry (Z) 
where Hpc(z) is the system function of the optimum filter, R,(z) is the complex PSD of 
x(n), and Ry,(z) is the complex cross-PSD between y(n) and x(n). 


or Ane (Zz) = 


(6.6.9) 


6.6.2 Causal IIR Filters 


For the causal IIR filter 


[o,@) 
y(n) = Ss he(k)x(n — k) (6.6.10) 
k=0 
the Wiener-Hopf equations (6.6.2) hold only for m in the range 0 < m < oo. Since the 
sequence ry(m) can be expressed as the convolution of h,(m) and r;(m) only for m = 0, 
we cannot solve (6.6.2) using the z-transform. However, a simple solution is possible 
using the spectral factorization theorem.’ This approach was introduced for continuous- 
time processes in Bode and Shannon (1950) and Zadeh and Ragazzini (1950). It is based 
on the following two observations: 


1. The solution of the Wiener-Hopf equations is trivial if the input is white. 
2. Any regular process can be transformed to an equivalent white process. 
White input processes. We first note that if the process x (1) is white noise, the solution 
of the Wiener-Hopf equations is trivial. Indeed, if 
re(I) = 0580) 
Then Equation (6.6.4) gives 


he(m) * 8(n) = 220) 0<m<o 
O*% 
which implies that 
1 
—Tryx(m) O0<m<o 
he(m) = 402 °° (6.6.11) 
0 m<0O 
because the filter is causal. The system function of the optimum filter is given by 
1 
A(z) = —[Ryx(Z)]+ (6.6.12) 
Ox 
[o,@) 
where [Rye © ory Oz (6.6.13) 
1=0 


is the one-sided z-transform of the two-sided sequence r,, (J). The MMSE is given by 


1 [e.e) 
Pe= ry) — Di lnx GP? (6.6.14) 
* k=0 


which follows from (6.6.3) and (6.6.11). 


Regular input processes. The PSD of a regular process can be factored as 


1 
Rx(@) = 05 Hx (2) Hy (=) (6.6.15) 
where H,(z) is the innovations filter (see Section 4.1). The innovations process 
CO 
w(n) = x(n) — Yo hy (k)w(n — k) (6.6.16) 
k=1 


: An analogous matrix-based approach is extensively used in Chapter 7 for the design and implementation of 
optimum FIR filters. 
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is white and linearly equivalent to the input process x(n). Therefore, linear estimation of 
y(n) based on x (7) is equivalent to linear estimation of y(7) based on w(n). The optimum 
filter that estimates y(n) from x(n) is obtained by cascading the whitening filter | / H(z) 
with the optimum filter that estimates y(m) from w(n) (see Figure 6.18). Since w(7) is 
white, the optimum filter for estimating y(n) from w(n) is 


’ 1 
A(z) = — [Ryw (214 (6.6.17) 


Ox 


where [Ryw(z)]+ is the one-sided z-transform of ry,(/). To express H(z) in terms of 
Ryx(z), we need the relationship between Ry» (z) and Ryx(z). From 


x(n) = Sohxkywn —k) 


k=0 
if we recall that ry,(J) = ryw(l) * h*(—D), we obtain 


Efy(n)x*(n —D} = SWE Mw*(n —I—k)} 
k=0 


or rye) = DOA rywl +4) (6.6.18) 
k=0 


Taking the z-transform of the above equation leads to 


Ryx(z) 
Ryy(Z) = = (6.6.19) 
pee Ss Beth) 
which, combined with (6.6.17), gives 
' 1 Ryx (2) 
A= laos | (6.6.20) 
x x + 


which is the causal optimum filter for the estimation of y(n) from w(n). The optimum filter 
for estimating y(n) from x(n) is 


ithe lal (6.6.21) 
+ 


o2 Hy (z) LH#(1/z*) 


which is causal since it is the cascade connection of two causal filters [see Figure 6.19(a)]. 


Optimum filter 


x) wv Ja, 5(n) 
A gel 


Whitening Optimum filter 
filter for white input 
FIGURE 6.18 
Optimum causal IIR filter design by the spectral factorization 
method. 


The MMSE from (6.6.3) can also be expressed as 


1 [o,@) 
Pe = ry) -— = .S Iryw (KDI? (6.6.22) 
ae k=0 
which shows that the MMSE decreases as we increase the order of the filter. Table 6.5 
summarizes the equations required for the design of optimum FIR and IIR filters. 


Optimum causal IIR filter 


x(n) 1[ &x@ | 
SEE ad es 
o? | HX(1z*) ie 
Whitening Optimum causal filter 
filter for white input 
(a) 


Optimum noncausal IIR filter 


R,,@) | 


o2|He(/z*) 
Whitening Optimum 
filter noncausal filter 
for white input 
(b) 


FIGURE 6.19 
Comparison of causal and noncausal IIR optimum filters. 


TABLE 6.5 
Design of FIR and IIR optimum filters for stationary processes. 


Filter type Solution Required quantities 
FIR e(n) = y(n) — ec x(n) R =[ry(m — k)], d = [ryx(m)] 
Co =R7!d 0<k,m<M-—1,M = finite 
Py = ry(0) —d" ey 
Ryx(z R = Z{ry(l 
Noncausal IIR Ane (Zz) = a) es es 
Rx (z) Ryx (2) = Z {ry (D} 
CO 
Poco = ry (0) = De Ane(k)ryy (k) 
k=—00 
1 Ryx (z) 
Causal IIR H(z) = ~ Ry (z) = 02 Hy (z) He (1/z* 
c(z) oH. @ LHe) |, x(Z) = 0% Hy (z) Hy (1/z") 
CO 
Po= ry(0) = hne(k)r} x (k) Ryx (Zz) = Z{rxy(D} 
k=0 


Finally, since the equation for the noncausal IIR filter can be written as 
1 Ryx(Z) 

o2 Ay (z) H*(1/2*) 

we see that the only difference from the causal filter is that the noncausal filter includes 


both the causal and noncausal parts of Ry, (z)/ Hx (z~!) [see Figure 6.19(b)]. By using the 
innovations process w(n), the MMSE can be expressed as 


Ane (Z) = (6.6.23) 


1 [o,@) 
Pac = Py) — = >> Mryw(k)? (6.6.24) 
a k=—00 
and is known as the irreducible MMSE because it is the best performance that can be 
achieved by a linear filter. Indeed, since |ry(k)| > 0, every coefficient we add to the 
optimum filter can help to reduce the MMSE. 
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6.6.3 Filtering of Additive Noise 


To illustrate the optimum filtering theory developed above, we consider the problem of 
estimating a “useful” or desired signal y(n) that is corrupted by additive noise v(n). The 
goal is to find an optimum filter that extracts the signal y(n) from the noisy observations 


x(n) = y(n) + v(n) (6.6.25) 
given that y(n) and v(m) are uncorrelated processes with known autocorrelation sequences 
ry(Z) and ry(). 

To design the optimum filter, we need the autocorrelation r,(/) of the input signal x (1) 


and the cross-correlation ry, (1) between the desired response y(7) and the input signal x (7). 
Using (6.6.25), we find 


rel) = E{x(n)x*(n —D} =HOt+n@ (6.6.26) 
and ryx(l) = E{y(n)x*(n — D} = ry) (6.6.27) 


because y(n) and v(7) are uncorrelated. 
The design of optimum IIR filters requires the functions R,(z) and Ry, (z). Taking the 
z-transform of (6.6.26) and (6.6.27), we obtain 


Rx (z) — Ry(Z) + Ry(z) (6.6.28) 
and Ryx(z) = Ry(2) (6.6.29) 


The noncausal optimum filter is given by 


Ryx (z) = Ry(z) 
Ry(z)—Ry(z) + Ro(2) 


which for z = e/® shows that, for those values of w for which |R y (e/”)| >> |Ry(e/)|, that 

is, for high SNR, we have | Hnc(e/)| ~ 1. In contrast, if |Ry(e/®)| < |Ry(e/®)]|, that is, 

for low SNR, we have | H,-(e/”)| + 0. Thus, the optimum filter “passes” its input in bands 

with high SNR and attenuates it in bands with low SNR, as we would expect intuitively. 
Substituting (6.6.30) into (6.6.7), we obtain for real-valued signals 


p= —$¢ BCR) pte (6.6.31) 
2m j Jo Ry(z) + Ro(Z) 


which provides an expression for the MMSE that does not require knowledge of the optimum 
filter. 

We next illustrate the design of optimum filters for the reduction of additive noise with 
a detailed numerical example. 


Ane (Z) = (6.6.30) 


EXAMPLE 6.6.1. In this example we illustrate the design of an optimum IIR filter to extract a 
random signal with known autocorrelation sequence 


ry) = al -l<a<1l (6.6.32) 
which is corrupted by additive white noise with autocorrelation 
ry(l) = 028(1) (6.6.33) 
The processes y(n) and v(n) are uncorrelated. 


Required statistical moments. The input to the filter is the signal x(n) = y(n) + v(n) and 
the desired response, the signal y(n). The first step in the design is to determine the required 
second-order moments, that is, the autocorrelation of the input process and the cross-correlation 
between input and desired response. Substituting into (6.6.26) and (6.6.27), we have 


rx) = al! + 628(0) (6.6.34) 
and ryx(l) = all! (6.6.35) 


To simplify the derivations and deal with “nice, round” numbers, we choose a = 0.8 and 0 =1. 
Then the complex power spectral densities of y(n), v(m), and x(n) are 


Ry(z) Gy Pees (6.6.36) 
AZ) = Zz onl 0. 
OS Gee Dia ay “8 4 
Ry@) =0%=1 (6.6.37) 
1-1 1 
gd—s,z  )d— 52) 
and Rx(z) = i ; (6.6.38) 
5 el = 5f ya = 5%) 
respectively. 
Noncausal filter. Using (6.6.9), (6.6.29), (6.6.36), and (6.6.38), we obtain 
Ryx(z 9 1 
see = i i < |z| <2 
Rx (Z) 401 - ne Od yo) 2 
Evaluating the inverse the z-transform we have 
hne(n) = Gl —~0o <n <0o 
which clearly corresponds to a noncausal filter. From (6.6.3), the MMSE is 
Styl asl 
Poe = 1- a ye () (2) = i (6.6.39) 


k=—00 
and provides the irreducible MMSE. 
Causal filter. To find the optimum causal filter, we need to perform the spectral factorization 
Ry (2) = of Hx (2) Hx (2!) 


which is provided by (6.6.38) with 


ot =8 (6.6.40) 
1- sie 
and Ay (z) = ri (6.6.41) 
1-277! 
5 
Ry : . ; 
Thus, Ryw(2) = yx (Z) = 0.36 0.6 0.3z (6.6.42) 


Axe!) d-$e7d-$2) 1-321 1-42 


where the first term (causal) converges for |z| > 3 and the second term (noncausal) converges 
for |z| < 2. Hence, taking the causal part 


| Ryx (Zz) = 2 
Hy") 14 1-327! 


and substituting into (6.6.21), we obtain the causal optimum filter 


4,-1 3 

eo ee 5 3 1 1 
H.(z) = = ~ 6.6.43 
e(z) os g(toisi Ikl < 5 (6.6.43) 


The impulse response is 


he(n) = 3(4)"u(n) 
which corresponds to a causal and stable IIR filter. The MMSE is 
[o.@) [o,@) 
Pe = ry) — )) helk)ryx(k) = 1-3 IG)EG = 3 (6.6.44) 
k=0 k=0 


which is, as expected, larger than Pye. 
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From (6.6.43), we see that the optimum causal filter is a first-order recursive filter that can 
be implemented by the difference equation 


$m) = $= 1) + 3x(n) 
In general, this is possible only when H,(z) is a rational function. 


Computation of MMSE using the innovation. We next illustrate how to find the MMSE by 
using the cross-correlation sequence ryy (/). From (6.6.42), we obtain 


22)’ 120 
rwD= 73 (6.6.45) 
52 1<0 


which, in conjunction with (6.6.22) and (6.6.24), gives 


Tee = 
Pe= ry) — Do rywh) =1- 3G? IG" = 
k=0 


x k=0 

face = 3 
and Pre =ry0)-— | Yorwh- Yo hwo | = a 

Px | k=0 k=—00 


which agree with (6.6.44) and (6.6.39). 


Noncausal smoothing filter. Suppose now that we want to estimate the value y(n + D) of the 
desired response from the data x(n), —co <n < o. Since 


E{yaa+ D)x(n—D} =ryx(n+ D) (6.6.46) 
and Zfryx(n + D)} = zP Ryx(z) (6.6.47) 
the noncausal Wiener smoothing filter is 


ZUR eZ) ns zP Ry(z) 


Hye) = Rey 72 ine) (6.6.48) 
AP.(n) = Mnc(n + D) (6.6.49) 
The MMSE is 
[o-e) 
PR =ry0)— SY haclk + D)ryx(k + D) = Pre (6.6.50) 
k=—0o 


which is independent of the time shift D. 


Causal prediction filter. We estimate the value y(n + D) (D > 0) of the desired response using 
the data x(k), —oo < k <n. The whitening part of the causal prediction filter does not depend 
on y(n) and is still given by (6.6.41). The coloring part depends on y(n + D) and is given by 
Ryy (2) = go Ral2) or Fy) = ryw(l + D). Taking into consideration that D > 0, we can 
show (see Problem 6.31) that the system function and the impulse response of the causal Wiener 
predictor are 


qa 41 3(4)D 3(4)D 
Hg) = :( =) ae =| =A (6.6.51) 
1- 7% 1- 5< _ 5% 
and nln) = 34)? yrun) (6.6.52) 


respectively. This shows that as D — oo, the impulse response nl?! (n) — 0, which is consistent 
with our intuition that the prediction is less and less reliable. The MMSE is 


(oe) 
pl? = aGy? +e ite, 2(2)?P (6.6.53) 
k=0 


and pl?) — ry(0) = 1 as D > oo, which agrees with our earlier observation. For D = 2, the 303 
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the data x(n), —oo < k < n, we need a smoothing Wiener filter. The derivation, which is 
straightforward but somewhat involved, is left for Problem 6.32. The system function of the 
optimum smoothing filter is 


—D-1 —D-1 
2D gz! 2D g,l-l 
> Beet wy 
1=0 1=0 


1o-1 [ha 5 }—1i,-1 


(6.6.54) 


3 
HP) (Z) = 8 


where D < 0. To find the impulse response for D = —2, we invert (6.6.54). This gives 


nl “Ny = Z5(k) + Sok — 1) + Bd 2wk = 2) (6.6.55) 


and if we express ry, (k — 2) in a similar form, we can compute the MMSE 


[=2)— 3 51 39. Sea 639.) 
Ps =1 0 ~ 400 (F709) 3 =e = 0.3047 (6.6.56) 


which is less than Pe = 0.375. This should be expected since the smoothing Wiener filter uses 
more information than the Wiener filter (i.e., when D = 0). In fact it can be shown that 


lim PMl= Pp, and lim ALP} n) = hyc(n) (6.6.57) 
D—--—oo D->-—oo 


which is illustrated in Figure 6.20 (Problem 6.22). Figure 6.21 shows the impulse responses 
of the various optimum IIR filters designed in this example. Interestingly, all are obtained by 
shifting and truncating the impulse response of the optimum noncausal IIR filter. 


Tae ot oe ed FIGURE 6.20 
1.000 F autiee **'| MMSE as a function of the time shift D. 
0.375 + . 7 
0.300 reeseeee2%" 7 

1 1 1 1 


FIR filter. The Mth-order FIR filter is obtained by solving the linear system 


Rh=d 
where R = Toeplitz( + 07,0, So veltS 
and d=[la---aM@-lyr 
The MMSE is 
M-1 
Po =ry(0)— S> holk)ryx(k) 
k=0 


and is shown in Figure 6.22 as a function of the order M together with P, and Pyc. We notice that 
an optimum FIR filter of order M = 4 provides satisfactory performance. This can be explained 
by noting that the impulse response of the causal optimum IIR filter is negligible for n > 4. 
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FIGURE 6.21 
Impulse response of optimum filters for pure filtering, prediction, and smoothing. 
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6.6.4 Linear Prediction Using the Infinite Past—Whitening 


The one-step forward IIR linear predictor is a causal IIR optimum filter with desired response 
y(n) = x(n +1). The prediction error is 


(nt 1) =x(nt1)— Yo hp(k)x(n —k) (6.6.58) 
k=0 


[o,@) 
where Hip(z) = > hip(k)z* (6.6.59) 
k=0 


is the system function of the optimum predictor. Since y(n) = x(n + 1), we have ry, (/) = 
rx(i + 1) and Ryx(z) = zRy(z). Hence, the optimum predictor is 


1 zo Hy(z) H(z!) [2 Ay (z) 4 2H, (zZ) — z 
Ap (z) ——} z = ——) ——} 
0% Ay (z) Ay (z~") + Ax (z) A, (Z) 
and the prediction error filter (PEF) is 
Appp(z) = 2) = 1-71 Ap(z) = eee (6.6.60) 
X(z) Pee He (2) 


that is, the one-step IIR linear predictor of a regular process is identical to the whitening 
filter of the process. Therefore, the prediction error process is white, and the prediction 
error filter is minimum-phase. We will see that the efficient solution of optimum filtering 
problems includes as a prerequisite the solution of a linear prediction problem. Furthermore, 
algorithms for linear prediction provide a convenient way to perform spectral factorization 
in practice. 


The MMSE is 
Pi = sot. Ry(z) —z|1— ! zt R* x zl dz 
a ‘ H,(z) # Nps 
re! 
= d 6.61 
On ROE aa - Z (6.6.61) 
1 1 
22 = 
b : H* z zo! dz=h,(0) =1 
ecause —— = — 
2njtc * zt = 


From Section 2.4.4 and (6.6.61) we have 


: 1 oe : 
Pt =o? =exp E i In R,(e/”) ao (6.6.62) 
2m Jen 
which is known as the Kolmogorov-Szeg6 formula. 

We can easily see that the D-step predictor (D > 0) is given by 


= (ie? Acie —k+D 
Hp@) =F = 7 , » hy (k)z (6.6.63) 


but is not guaranteed to be minimum-phase for D # 1. 
EXAMPLE 6.6.2. Consider a minimum-phase AR(2) process 
x(n) = ayx(n — 1) + anx(n — 2) + w(n) 
where w(n) ~ WN(O, o2,). The complex PSD of the process is 


. 
0 
Ry(z) = ——*— 407A, (2) Ay (z7! 
1 = Fay Soe OH) 
where A(z) & 1 — ayz~! — ayz~? and o2 — o%,. The one-step forward predictor is given by 
-1 
My (z) = z— =7z-—7zA(z) =a, +.a2z 
- Ax() 
or x(n +1) =ayx(n) +anx(n — 1) 


as should be expected because the present value of the process depends only on the past two 
values. Since the excitation w(n) is white and cannot be predicted from the present or previous 
values of the signal x(n), it is equal to the prediction error ef (n). Therefore, oF; = Gas as 
expected from (6.6.62). This shows that the MMSE of the one-step linear predictor depends 
on the SFM of the process x(n). It is maximum for a white noise process, which is clearly 


unpredictable. 
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Predictable processes. A random process x(n) is said to be (exactly) predictable if 
Po= E{\e! (n)|7} = 0. We next show that a process x(7) is predictable if and only if its 
PSD consists of impulses, that is, 


R,(e/®) = os Az5(@ — wx) (6.6.64) 
k 


or in other words, x(n) is a harmonic process. For this reason harmonic processes are also 
known as deterministic processes. From (6.6.60) we have 


pe 

Pe = Ele! (n)?} = / | Hpep(e!®)/? Rx (e!®) dw (6.6.65) 
It: 

where Appr (ef ) is the frequency response of the prediction error filter. Since R,, (e/ “)>0, 

the integral in (6.6.65) is zero if and only if |Hppp(e/”)|7R,(e/°) = 0. This is possible 

only if Ry (e/) is a linear combination of impulses, as in (6.6.64), and e/®k are the zeros 

of Hpgp(z) on the unit circle (Papoulis 1985). 

From the Wold decomposition theorem (see Section 4.1.3) we know that every random 
process can be decomposed into two components that are mutually orthogonal: (1) a regular 
component with continuous PSD that can be modeled as the response of a minimum-phase 
system to white noise and (2) a predictable process that can be exactly predicted from 
a linear combination of past values. This component has a line PSD and is essentially a 
harmonic process. A complete discussion of this subject can be found in Papoulis (1985, 
1991) and Therrien (1992). 


6.7 INVERSE FILTERING AND DECONVOLUTION 


In many practical applications, a signal of interest passes through a distorting system whose 
output may be corrupted by additive noise. When the distorting system is linear and time- 
invariant, the observed signal is the convolution of the desired input with the impulse 
response of the system. Since in most cases we deal with linear and time-invariant systems, 
the terms filtering and convolution are often used interchangeably. 

Deconvolution is the process of retrieving the unknown input of a known system by 
using its observed output. If the system is also unknown, which is more common in practical 
applications, we have a problem of blind deconvolution. The term blind deconvolution 
was introduced in Stockham et al. (1975) for a method used to restore old records. Other 
applications include estimation of the vocal tract in speech processing, equalization of 
communication channels, deconvolution of seismic data for the elimination of multiple 
reflections, and image restoration. 

The basic problem is illustrated in Figure 6.23. The output of the unknown LTI system 
G(z), which is assumed BIBO stable, is given by 


ee) 


x(n)= Y~ g(k)wn—k) (6.7.1) 


k=—0o 


where w(n) ~ IID(O, aa) is a white noise sequence. Suppose that we observe the output 
x(n) and that we wish to recover the input signal w (7), and possibly the system G(z), using 
the output signal and some statistical information about the input. 


w(n) x(n) y(n) FIGURE 6.23 
G(z) H(z) Basic blind deconvolution model. 
Unknown 
input — Unknown Deconvolution 


system filter 


If we know the system G(z), the inverse system H(z) is obtained by noticing that 
perfect retrieval of the input is possible if 
h(n) * g(n) * w(n) = bow(n — no) (6.7.2) 
where bo and no are constants. From (6.7.2), we have h(n) * g(n) = bod(n — no), or 
equivalently 


geo 


G(z) 
which provides the system function of the inverse system. The input can be recovered by 
convolving the output with the inverse system H(z). Therefore, the terms inverse filtering 
and deconvolution are equivalent for LTI systems. 
There are three approaches for blind deconvolution: 


H(z) = bo 


(6.7.3) 


e Identify the system G(z), design its inverse system H(z), and then compute the input 
w(n). 

e Identify directly the inverse H(z) = 1/G(z) of the system, and then determine the input 
w(n). 

e Estimate directly the input w(7) from the output x (7). 


Any of the above approaches requires either directly or indirectly the estimation of 
both the magnitude response |G(e/®)| and the phase response <G(e/®) of the unknown 
system. In practice, the problem becomes more complicated because the output x(n) is 
usually corrupted by additive noise. If this noise is uncorrelated with the input signal and 
the required second-order moments are available, we show how to design an optimum 
inverse filter that provides an optimum estimate of the input in the presence of noise. In 
Section 6.8 we apply these results to the design of optimum equalizers for data transmission 
systems. The main blind identification and deconvolution problem, in which only statistical 
information about the output is known, is discussed in Chapter 12. 

We now discuss the design of optimum inverse filters for linearly distorted signals 
observed in the presence of additive output noise. The typical configuration is shown in 
Figure 6.24. Ideally, we would like the optimum filter to restore the distorted signal x(n) to 
its original value y(7). However, the ability of the optimum filter to attain ideal performance 
is limited by three factors. First, there is additive noise v(m) at the output of the system. 
Second, if the physical system G(z) is causal, its output s(7) is delayed with respect to the 
input, and we may need some delay z~? to improve the performance of the system. When 
G(z) is a non-minimum-phase system, the inverse system is either noncausal or unstable 
and should be approximated by a causal and stable filter. Third, the inverse system may be 
IIR and should be approximated by an FIR filter. 


y(n— D) 


FIGURE 6.24 
Typical configuration for optimum inverse system modeling. 


The optimum inverse filter is the noncausal Wiener filter 


-DR . 
Ane (Z) = ss 


where the term z~? appears because the desired response is yp(n) 4 y(n — D). Since y(n) 


(6.7.4) 
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and v(n) are uncorrelated, we have 


Ryx (Z)= Rys (z) (6.7.5) 
Ry (z) = G(z)G* (=) Ry(z) + Ry(z) (6.7.6) 


The cross-correlation between y(n) and s(n) 


Rys(z) = G* (=) Ry(z) (6.7.7) 


is obtained by using Equation (6.6.18). Therefore, the optimum inverse filter is 


z_PG*(1/z*)Ry() 


Hpc(z) = 6.7.8 
nel) = COG /e)Ry@) + Ro@) ere 
which, in the absence of noise, becomes 
27D 
Hye (z) = —— 6.7.9 
nc (Z) G@ ( ) 


as expected. The behavior of the optimum inverse system is illustrated in the following 
example. 


EXAMPLE 6.7.1. Let the system G(z) be an all-zero non-minimum-phase system given by 
GQ) = §(-3¢ + 7-221) =- 30 — fz )@—-2) 
Then the inverse system is given by 
5 1 1 


HQ =G'@= = 
ce -) See TA ee. boa eae! 


which is stable if the ROC is -; < |z| < 2. Therefore, the impulse response of the inverse 
system is 


Gy" n=0 
h(n) = 
Qe n<0 


which is noncausal and stable. 

Following the discussion given in this section, we want to design an optimum inverse 
system given that G(z) is driven by a white noise sequence y(7) and that the additive noise v(1) 
is white, that is, Ry(z) = oF and Ry(z) = Cee From (6.7.8), the optimum noncausal inverse 
filter is given by 


z7P 


H, = 
nO = G4 Ge-Dlez/e) 


2 


y and o>. Note that if 


which can be computed by assuming suitable values for variances o 


o < oy that is, for very large SNR, we obtain (6.7.9). 

A more interesting case occurs when the optimum inverse filter is FIR, which can be easily 
implemented. To design this FIR filter, we will need the autocorrelation r; (J) and the cross- 
correlation rypx(), where yp(n) = y(n — D) is the delayed system input sequence. Since 


Ry(z) = 05, GW)G!) +05 
and Ryyx@) Sone, PGE) 
we have (see Section 3.4.1) 
rxe@) = 8X) * g(-D xryO +r) = slg) * g(—l] +.028(1) 
and rypx(l) = g(-l) x ry(U — D) = 05 g(-1 + D) 


respectively. Now we can determine the optimum FIR filter hp of length M by constructing an 
M x M Toeplitz matrix R from r; (J) and an M x 1 vector d from ry, (/) and then solving 


Rhp =d 


MMSE (D) 


Ayn) 


for various values of D. We can then plot the MMSE as a function of D to determine the best 
value of D (and the corresponding FIR filter) which will give the smallest MMSE. For example, 
if oF =; o2 = 0.1, and M = 10, the correlation functions are 


6 7 129 7 6 27 3 
rx() =] 25° 5’ ao" 5° 25 and rypx() = 5° as 5 
fag 1=D 


The resulting MMSE as a function of D is shown in Figure 6.25, which indicates that the best 
value of D is approximately M/2. Finally, plots of impulse responses of the inverse system are 
shown in Figure 6.26. The first plot shows the noncausal h(n), the second plot shows the causal 


0.25 T T T T T T T T T T FIGURE 6.25 
The inverse filtering MMSE as a function 
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FIGURE 6.26 
Impulse responses of optimum inverse filters. 
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FIR system ho(n) for D = 0, and the third plot shows the causal FIR system h p(n) for D = 5. 
It is clear that the optimum delayed FIR inverse filter for D ~ M/2 closely matches the impulse 
response of the inverse filter h(n). 


6.8 CHANNEL EQUALIZATION IN DATA TRANSMISSION SYSTEMS 


The performance of data transmission systems through channels that can be approximated 
by linear systems is limited by factors such as finite bandwidth, intersymbol interference, 
and thermal noise (see Section 1.4). Typical examples include telephone lines, microwave 
line-of-sight radio links, satellite channels, and underwater acoustic channels. When the 
channel frequency response deviates form the ideal of flat magnitude and linear phase, both 
(left and right) tails of a transmitted pulse will interfere with neighboring pulses. Hence, 
the value of a sample taken at the center of a pulse will contain components from the tails 
of the other pulses. The distortion caused by the overlapping tails is known as intersymbol 
interference (ISI), and it can lead to erroneous decisions that increase the probability of 
error. For band-limited channels with low background noise (e.g., voice band telephone 
channel), ISI is the main performance limitation for high-speed data transmission. In radio 
and undersea channels, ISI is the result of multipath propagation (Siller 1984). 
Intersymbol interference occurs in all pulse modulation systems, including frequency- 
shift keying (FSK), phase-shift keying (PSK), and quadrature amplitude modulation (QAM). 
However, to simplify the presentation, we consider a baseband pulse amplitude modulation 
(PAM) system. This does not result in any loss of generality because we can obtain an 
equivalent baseband model for any linear modulation scheme (Proakis 1996). We consider 
the K-ary (K = 24) PAM communication system shown in Figure 6.27(a). The binary 
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(a) Baseband pulse amplitude modulation data transmission system model and (b) input 
symbol sequence dy. 


input sequence is subdivided into L-bit blocks, or symbols, and each symbol is mapped 
to one of the K amplitude levels, as shown in Figure 6.27(b). The interval Tg is called 
the symbol or baud interval while the interval Tp is called the bit interval. The quantity 
Rp = 1/Tp is known as the baud rate, and the quantity R, = L7,z is the bit rate. 

The resulting symbol sequence {a,} modulates the transmitted pulse g;(t). For analysis 
purposes, the symbol sequence {a,,} can be represented by an equivalent continuous-time 


signal using an impulse train, that is, 
[o,@) 


{an}% & D> and(t —nTp) (6.8.1) 

n=—OCO 
The modulated pulses are transmitted over the channel represented by the impulse response 
h(t) and the additive noise u,(t). The received signal is filtered by the receiving filter g,(t) 
to obtain x(t). Using (6.8.1), the signal x(t) at the output of the receiving filter is given by 


X(t) = a ag {S(t — kTB) * y(t) * he (t) * Br (t)} + UVe(t) * a(t) 
ee (6.8.2) 
DE ache (t — kT OO) 
k=—0o 
where hy(t) 4 Bt(t) * he (t) * g(t) (6.8.3) 


is the impulse response of the combined system of transmitting filter, channel, and receiving 
filter, and 


v(t) = g(t) * Ue(t) (6.8.4) 


is the additive noise at the output of the receiving filter. 


6.8.1 Nyquist’s Criterion for Zero ISI 


If we sample the received signal x(t) at the time instant f9 + nT, we obtain 
Cc 
X(to +nTp) = a axh,(to +nTp —kTg) + 0(t0 +nTp) 


k=—0oo 


00 (6.8.5) 
= anhy(to) + Y > ah; (to +nTp — kTg) + 0(t0 + nTp) 
k=—00 
k#n 
where fo accounts for the channel delay and the sampler phase. The first term in (6.8.5) is 


the desired signal term while the third term is the noise term. The middle term in (6.8.5) 
represents the ISI, and it will be zero if and only if 


hy(t. +nTg —kTp) =0 nk (6.8.6) 


As was first shown by Nyquist (Gitlin, Hayes, and Weinstein 1992), a time-domain pulse 
h,(t) will have zero crossings once every Tz s, that is, 


Rise SS (6.8.7) 
rTp) = 0 oer 8. 
if its Fourier transform satisfies the condition 
lee) 
~ I 
>” F, (F ny =x) = Tz (6.8.8) 
Tp 


This condition is known as the Nyquist criterion for zero ISI and its basic meaning is 
illustrated in Figure 6.28. 
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FIGURE 6.28 
Frequency-domain Nyquist criterion for zero ISI. 


A pulse shape that satisfies (6.8.8) and that is widely used in practice is of the raised 
cosine family 
sin(zt/Tp) cos(rat/ Tp) 
mt/Tgp 1 —4a?1?/T3 
where 0 < @ < 1 is known as the rolloff factor. This pulse and its Fourier transform 
for a = 0,0.5, and | are shown in Figure 6.29. The choice of a = 0 reduces hyc(t) 
to the unrealizable sinc pulse and Rg, = 1/Tg, whereas for a = | the symbol rate is 
Rp = 1/(2Tp). In practice, we can see the effect of ISI and the noise if we display the 
received signal on the vertical axis of an oscilloscope and set the horizontal sweep rate at 
1/Tg. The resulting display is known as eye pattern because it resembles the human eye. 
The closing of the eye increases with the increase in ISI. 


hic(t) = (6.8.9) 


6.8.2 Equivalent Discrete-Time Channel Model 


Referring to Figure 6.27(a), we note that the input to the data transmission system is a 
discrete-time sequence {a,} at the symbol rate 1/7, symbols per second, and the input to 
the detector is also a discrete-time sequence x(nTg) at the symbol rate. Thus the overall 
system between the input symbols and the equalizer can be modeled as a discrete-time 
channel model for further analysis. From (6.8.2), after sampling at the symbol rate, we 


obtain 
lee) 


&(nTg) = Y > agh-(nTg — kTg) + 0(nTp) (6.8.10) 

k=—00 
where hy(t) is given in (6.8.3) and v(f) is given in (6.8.4). The first term in (6.8.10) can be 
interpreted as a discrete-time IIR filter with impulse response’ hy (n) = h,(nTg) with input 


"Here we have abused the notation to avoid a new symbol. 


hye(0) 


tie =1/2Ty 0 1/2T 1/T, 


FIGURE 6.29 
Pulses with a raised cosine spectrum. 


ax. In a practical data transmission system, it is not unreasonable to assume that h(n) =0 
for |n| > L, where L is some arbitrary positive integer. Then we obtain 


L 
X(n) = agh,(n — k) + B(n) 
& (6.8.11) 
X(n) £X(nTg) —-B(n) & HT) 


which is an FIR filter of length 2Z + 1, shown in Figure 6.30. 


v(n) 


FIGURE 6.30 
Equivalent discrete-time model of data transmission system with ISI. 


There is one difficulty with this model. If we assume that the additive channel noise 
Uc(t) is zero-mean white, then the equivalent noise sequence vu(n) is not white. This can be 
seen from the definition of v(t) in (6.8.4). Thus the autocorrelation of v(m) is given by 


r3(1) = 0274, (1) (6.8.12) 


where ae is the variance of the samples of vu, (t) and rg, (J) is the sampled autocorrelation of 
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gr(t). This nonwhiteness of v(t) poses a problem in the subsequent design and performance 
evaluation of equalizers. Therefore, in practice, it is necessary to whiten this noise by 
designing a whitening filter and placing it after the sampler in Figure 6.27(a). The whitening 
filter is designed by using spectral factorization of Z[r,, (/)]. Let 


Rg, (z) = Z[rg,0)] = Re @)Re, @) (6.8.13) 


where Ry (z) is the minimum-phase factor and Re. (z) is the maximum-phase factor. Choos- 
ing 


W(z) = (6.8.14) 


REZ) 


as a causal, stable, and recursive filter and applying the sampled sequence x(n) to this filter, 
we obtain 


x(n) © w(n) *¥(n) = > agh,(n — k) + vn) (6.8.15) 

k=0 
where h(n) = hy(n) * w(n) (6.8.16) 
and v(n) £ w(n) x i(n) (6.8.17) 


The spectral density of v(7), from (6.8.12), (6.8.13), and (6.8.14), is given by 
Ry(2) = Ru (@)R5() = —_————o? Rt (RZ @) = 0? (6.8.18) 
v w v Re ()Re, (z) v*\ gr &r v ds 


which means that v(”) is a white sequence. Once again, assuming that h,(n) = 0,n > L, 
where L is an arbitrary positive integer, we obtain an equivalent discrete-time channel 
model with white noise 


L 
x(n) = 2S agh,(n — k) + v(n) (6.8.19) 
k=0 


This equivalent model is shown in Figure 6.31. An example to illustrate the use of this 
model in the design and analysis of an equalizer is given in the next section. 


FIGURE 6.31 
Equivalent discrete-time model of data transmission system with ISI and WGN. 


6.8.3 Linear Equalizers 


If we know the characteristics of the channel, that is, the magnitude response |H,(F)| and 
the phase response 4 H,(F), we can design optimum transmitting and receiving filters that 
will maximize the SNR and will result in zero ISI at the sampling instant. However, in 
practice we have to deal with channels whose characteristics are either unknown (dial-up 
telephone channels) or time-varying (ionospheric radio channels). In this case, we usually 


use a receiver that consists of a fixed filter g,(¢) and an adjustable linear equalizer, as shown 
in Figure 6.32. The response of the fixed filter either is matched to the transmitted pulse 
or is designed as a compromise equalizer for an “average” channel typical of the given 
application. In principle, to eliminate the ISI, we should design the equalizer so that the 
overall pulse shape satisfies Nyquist’s criterion (6.8.6) or (6.8.8). 


Fromthe — X(t) X(nT) I(nTp) 4 
receiving . Equalizer Fats Detector 
filter : 


(a) Continuous-time model 


From the X(n) Equalizer 3(n) G, 
equivalent fe(n) My Detector 
model 
(b) Discrete-time model for synchronous equalizer 


FIGURE 6.32 
Equalizer-based receiver model. 


The most widely used equalizers are implemented using digital FIR filters. To this 
end, as shown in Figure 6.32(a), we sample the received signal x(t) periodically at times 
t = t9 + nT, where fo is the sampling phase and T is the sampling period. The sampling 
period should be less or equal to the symbol interval Tg because the output of the equalizer 
should be sampled once every symbol interval (the case T > Tg creates aliasing). For 
digital implementation T should be chosen as a rational fraction of the symbol interval, that 
is, T = L,Tp/L2, with L, < Lz (typical choices are T = Tg, T = Tg/2, or T = 27/3). 
If the sampling interval T = Tg, we have a synchronous or symbol equalizer (SE)' and if 
T < Tp afractionally spaced equalizer (FSE ie The output of the equalizer is quantized to 
obtain the decision ay. 

The goal of the equalizer is to determine the coefficients {cy} mu SO as to minimize 
the ISI according to some criterion of performance. The most meaningful criterion for 
data transmission is the average probability of error. However, this criterion is a nonlinear 
function of the equalizer coefficients, and its minimization is extremely difficult. 

We next discuss two criteria that are used in practical applications. For this discussion 
we assume a synchronous equalizer, that is, 7 = Tg. The FSE is discussed in Chapter 12. 
For the synchronous equalizer, the equivalent discrete-time model given in Figure 6.31 is 
applicable in which the input is x (7), given by 


x(n) = sc —l)+v(n) (6.8.20) 

The output of the equalizer is given — 
$(n) = > c*(k)x(n — k) £ e# x(n) (6.8.21) 
where c= an: ++ ¢(0) «++ c(M)]" (6.8.22) 
x(n) = [x(n + M) --- x(n) --» x(n— MIP (6.8.23) 


This equalizer model is shown in Figure 6.32(b). 


Also known as a baud-spaced equalizer (BSE). 

"The most significant difference between SE and FSE is that by properly choosing T we can completely avoid 
aliasing at the input of the FSE. Thus, the FSE can provide better compensation for timing phase and asymmetries 
in the channel response without noise enhancement (Qureshi 1985). 
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6.8.4 Zero-Forcing Equalizers 


Zero-forcing (zf) equalization (Lucky, Saltz, and Weldon 1968) requires that the response 
of the equalizer to the combined pulse h,(t) satisfy the Nyquist criterion (6.8.7). For the 
FIR equalizer in (6.8.21), in the absence of noise we have 


M 
Ce ae (6.8.24) 
oS, 0 n=+1,+2,...,4M o 


which is a linear system of equations whose solution provides the required coefficients. The 
zero-forcing equalizer does not completely eliminate the ISI because it has finite duration. 
If M = oo, Equation (6.8.24) becomes a convolution equation that can be solved by using 
the z-transform. The solution is 


Cop (Z) = 


He@ (6.8.25) 


where H(z) is the z-transform of h,(n). Thus, the zero-forcing equalizer is an inverse filter 
that inverts the frequency-folded (aliased) response of the overall channel. When &/ is finite, 
then it is generally impossible to eliminate the ISI at the output of the equalizer because 
there are only 2M + 1 adjustable parameters to force zero ISI outside of [—-M, M]. Then 
the equalizer design problem reverts to minimizing the peak distortion 


De 

n#0 

This distortion function can be shown to be a convex function (Lucky 1965), and its mini- 

mization, in general, is difficult to obtain except when the input ISI is less than 100 percent 

(i.e., the eye pattern is open). This minimization and the determination of {cz} can be 
obtained by using the steepest descent algorithm, which is discussed in Chapter 10. 

Zero-forcing equalizers have two drawbacks: (1) They ignore the presence of noise 

and therefore amplify the noise appearing near the spectral nulls of H,(e/®), and (2) they 

minimize the peak distortion or worst-case ISI only when the eye is open. For these reasons 

they are not currently used for bad channels or high-speed modems (Qureshi 1985). The 

above two drawbacks are eliminated if the equalizers are designed using the MSE criterion. 


M 


Yo ca (kyhe(n — k) 


k=—M 


(6.8.26) 


6.8.5 Minimum MSE Equalizers 


It has been shown (Saltzberg 1968) that the error rate Pr{a, 4 a,} decreases monotonically 
with the MSE defined by 

MSE = E{|e(n)|7} (6.8.27) 
where e(n) = y(n) — ¥(n) = ay — y(n) (6.8.28) 
is the difference between the desired response y(n) = a, and the actual response }(n) given 
in (6.8.21). Therefore, if we minimize the MSE in (6.8.27), we take into consideration both 


the ISI and the noise at the output of the equalizer. For M = oo, following the arguments 
similar to those leading to (6.8.25), the minimum MSE equalizer is specified by 


Hf (1/z*) 
A, (z) Hi (1/z*) +04 
where o is the variance of the sampled channel noise vu, (kTg). Clearly, (6.8.29) reduces to 


the zero-forcing equalizer if oe = 0. Also (6.8.29) is the classical Wiener filter. For finite 
M, the minimum MSE equalizer is specified by 


Re, =d (6.8.30) 


CmsE(Z) = (6.8.29) 


P, = P, —c#a (6.8.31) 


where R = E{x(n)x” (n)} and d = E{a*x(n)}. The data sequence y(n) = ay, is assumed 
to be white with zero mean and power P, = E{|a,|*}, and uncorrelated with the additive 
channel noise. Under these assumptions, the elements of the correlation matrix R and the 
cross-correlation vector d are given by 


rij = E{x(n —i)x*(n — j)} 
= Pa) hyn —i)hi(m — j) +025; —-M<i,j <M (6.8.32) 


m 
and dj & E{x(n—i)y*(n)} = Pah;(-i) —-M<i,j <M (6.8.33) 


that is, in terms of the overall (equivalent) channel response h,(n) and the noise power Can 


We hasten to stress that matrix R is Toeplitz if T = Tg; otherwise, for T ~ Tz, matrix R 
is Hermitian but not Toeplitz. 

Since MSE equalizers, in contrast to zero-forcing equalizers, take into account both the 
statistical properties of the noise and the ISI, they are more robust to both noise and large 
amounts of ISI. 


EXAMPLE 6.8.1. Consider the model of the data communication system shown in Figure 6.33. 

The input symbol sequence {a(m)} is a Bernoulli sequence {+1}, with Pr{1} = Pr{—1} = 0.5. 

The channel (including the receiving and whitening filter) is modeled as 

2m (n — 2) 
W 


n=1,2,3 


h(n) = 


0.5 E + cos (6.8.34) 


otherwise 


where W controls the amount of amplitude distortion introduced by the channel. The channel 
impulse response values are (the arrow denotes the sample at n = 0) 


Kay = 10,051 4008) ,1,05'(14c08— ) 6 (6.8.35) 
= 40,0. cos — },1,0. cos — }, 8. 
AH t W W 


which is a symmetric channel, and its frequency response is 
; ; 2 
H(e/®) = e J” 1 + (: + cos =) cos o| 


The channel noise v(7) is modeled as white Gaussian noise (WGN) with zero mean and variance 
a. The equalizer is an 11-tap FIR filter whose optimum tap weights {c(m)} are obtained using 
either optimum filter theory (nonadaptive approach) or adaptive algorithms that will be described 
in Chapter 10. The input to the equalizer is 


x(n) = s(n) + v(n) = h(n) * a(n) + v(n) (6.8.36) 


where s(7) represents the distorted pulse sequence. The output of the equalizer is )(n), which is 
an estimate of a(n). In practical modem implementations, the equalizer is initially designed using 
a training sequence that is known to the receiver. It is shown in Figure 6.33 as the sequence y(n). 
It is reasonable to introduce a delay D in the training sequence to account for delays introduced 
in the channel and in the equalizer; that is, y(n) = a(n — D) during the training phase. The error 
sequence e() is further used to design the equalizer c(n). The aim of this example is to study 
the effect of the delay D and to determine its optimum value for proper operation. 


FIGURE 6.33 
Data communication model 
used in Example 6.8.1. 


e(n) 


Channel Equalizer 
h(n) c(n) 


v(n) 
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To obtain an optimum equalizer c(n), we will need the autocorrelation matrix Ry of the 
input sequence x(n) and the cross-correlation vector d between x(n) and y(n). Consider the 
autocorrelation r, (/) of x(n). From (6.8.36), assuming real-valued quantities, we obtain 


ry (7) = E{[s(n) + v(n)] [sv — 1) tua —D]} =rsO + o75(1) (6.8.37) 


where we have assumed that s(n) and v(n) are uncorrelated. Since s(n) is a convolution between 
{an} and h(n), the autocorrelation rs (J) is given by 


rsQ) =raO *rnO = nO (6.8.38) 


where rg (/) = d(l) since {a(n)} is a Bernoulli sequence, and rz (J) is the autocorrelation of the 
channel response h(n) and is given by 


rn) = h@) *h(-) 


Using the symmetric channel response values in (6.8.35), we find that the autocorrelation rx (/) 
in (6.8.37) is given by 


Qn \? 
rx(0) = h? (1) +h? (2) +73) +02 = 140.5 (1 + cos =) +07 


ry (£1) = A(1)h(2) + A(2)h(3) = 1 + cos zis 
W (6.8.39) 


In \2 
ry (#2) = h()h(B) = 0.25 (1 + cos 7) 


nrO=0 [23 


Since the equalizer is an 11-tap FIR filter, the autocorrelation matrix Ry is an 11 x 11 matrix. 
However, owing to few nonzero values of rx (/) in (6.8.39), it is also a quintdiagonal matrix 
with the main diagonal containing r,(0) and two upper and lower non-zero diagonals. The 
cross-correlation between x(n) and y(n) = a(n — D) is given by 
d(l) = E{a(n — D)x(n —D} = Efa(n — D)[s(n —1) + Vn — DI} 

= E{a(n — D)s(n —1)} + E{a(n — D)v(n —D} 

= E{a(n — D)[h(n — 1) x a(n — 1)]} 

=h(D —1)*rg(D—1) =h(D—-I) 


(6.8.40) 


where we have used (6.8.36). The last step follows from the fact that rq(/) = 6(/). Using the 
channel impulse response values in (6.8.35), we obtain 


D=0 dd=h(-J)=0 120 
D=1 d@d=hd-)D>dO)=hl) dd)=0 150 


D=2 dQ)=h2-DS>d0)=hQ) dd)=hC) d@=0 I>1 
(6.8.41) 


D=7T dM)=hI-)>Sd4=hCB) dSb)=h2) d6)=h(\) 
d(l)=0 _— elsewhere 


Remarks. There are some interesting observations that we can make from (6.8.41) in which the 
delay D turns the estimation problem into a filtering, prediction, or smoothing problem. 


1. When D = 0, we have a filtering case. The cross-correlation vector d = 0, hence the 
equalizer taps are all zeros. This means that if we do not provide any delay in the system, the 
cross-correlation is zero and equalization is not possible because cy = 0. 

2. When D = 1, we have a one-step prediction case. 

3. When D > 2, we have a smoothing filter, which provides better performance. When D = 7, 
we note that the vector d is symmetric [with respect to the middle sample d(5)] and hence 
we should expect the best performance because the channel is also symmetric. We can also 
show that D = 7 is the optimum delay for this example (see Problem 6.40). However, this 
should not be a surprise since h(n) is symmetric about n = 2, and if we make the equalizer 
symmetric about n = 5, then the channel input a(n) is delayed by D=5+4+2=7. 


Figure 6.34 shows the channel impulse response h(n) and the equalizer c(n) for D = 7, o% = 
0.001, and W = 2.9 and W = 3.1. 


Channel for W = 2.9 Channel for W = 3.1 
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FIGURE 6.34 


Channel impulse response h(n) and the equalizer c(n) for D = 7, 
o% = 0.001, and W = 2.9 and W = 3.1. 


6.9 MATCHED FILTERS AND EIGENFILTERS 


In this section we discuss the design of optimum filters that maximize the output signal-to- 
noise power ratio. Such filters are used to detect signals in additive noise in many appli- 
cations, including digital communications and radar. First we discuss the case of a known 
deterministic signal in noise, and then we extend the results to the case of a random signal 
in noise. 

Suppose that the observations obtained by sampling the output of a single sensor at M 
instances, or M sensors at the same instant, are arranged in a vector x(”). Furthermore, we 
assume that the available signal x(7) consists of a desired signal s(n) plus an additive noise 
plus interference signal v(7), that is, 


x(n) = s(n) + v(n) (6.9.1) 


where s() can be one of two things. It can be a deterministic signal of the form s(n) = aso, 
where So is the completely known shape of s(7) and @ is a complex random variable 
with power Py = E{\a|*}. The argument <a provides the unknown initial phase, and the 
modulus |a|, the amplitude of the signal, respectively. It can also be a random signal with 
known correlation matrix R,; (1). The signals s(m) and v() are assumed to be uncorrelated 
with zero means. 

The output of a linear processor (combiner or FIR filter) with coefficients { ey is 


y(n) = e@ x(n) = ce" s(n) +e v(n) (6.9.2) 
and its power Py (n) = E{|y(n)|?} = E{e4x(n)x" (n)e} = eR, (nye (6.9.3) 


is a quadratic function of the filter coefficients. 
The output noise power is 


P,(n) = E{|e*v(n)|"} = Efe” v(n)v" (n)c} = c4 R, (nye (6.9.4) 


319 


SECTION 6.9 
Matched Filters and 
Eigenfilters 


320 


CHAPTER 6 
Optimum Linear Filters 


where R,(7) is the noise correlation matrix. The determination of the output SNR, and 
hence the subsequent optimization, depends on the nature of the signal s(n). 


6.9.1 Deterministic Signal in Noise 


In the deterministic signal case, the power of the signal is 


Ps(n) = E{lae™so|"} = Pale” sol” (6.9.5) 
and therefore the output SNR can be written as 
|e so|* 
SNR(c) = Py =——— 6.9.6 
©) = PuaR Ge (6.9.6) 


White noise case. If the correlation matrix of the additive noise is given by Ry(”) = 
PyI, the SNR becomes 


Py le so/? 
SNR(c) = — 6.9.7 
O=F aie (6.9.7) 
which simplifies the maximization process. Indeed, from the Cauchy-Schwartz inequality 
4 s9<(c% ¢)'/? (6 s0)'/* (6.9.8) 
we conclude that the SNR in (6.9.7) attains its maximum value 
P. 
SNRmax = —~S¢'so (6.9.9) 
Py 
if the optimum filter c, is chosen as 
Co = KSO (6.9.10) 


that is, when the filter is a scaled replica of the known signal shape. This property resulted in 
the term matched filter, which is widely used in communications and radar applications.’ We 
note that if a vector c, maximizes the SNR (6.9.7), then any constant « times ¢, maximizes 
the SNR as well. Therefore, we can choose this constant in any way we want. In this section, 
we choose k so that cl! so a 


Colored noise case. Using the Cholesky decomposition Ry = L,L” of the noise 
correlation matrix, we can write the SNR in (6.9.6) as 


|i’) (Ly 'so) I? 


SNR(c) = Py (Le 4 (LHe) (6.9.11) 
which, according to the Cauchy-Schwartz inequality, attains its maximum 
SNRmax = Px||Ly'soll? = Post! RZ 'so (6.9.12) 
when the optimum filter satisfies L? Co = KL; !so, or equivalently 
Co = KR5!s0 (6.9.13) 


which provides the optimum matched filter for color additive noise. Again, the optimum filter 
can be scaled in any desirable way. We choose ct So = | which implies k = (sid R, i) ee 

If we pass the observed signal through the preprocessor L., we obtain a signal L, Is 
in additive white noise V = Ly 'v because E(w" } = Fil. wy be} = I. Therefore, the 
optimum matched filter in additive color noise is the cascade of a whitening filter followed 
by a matched filter for white noise (compare with a similar decomposition for the optimum 


"We note that the matched filter cy in (6.9.10) is not a complex conjugate reversed version of the signal s. This 
happens when we define the matched filter as a convolution that involves a reversal of the impulse response 
(Therrien 1992). 


Wiener filter in Figure 6.19). The application of the optimum matched filter is discussed in 321 
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EXAMPLE 6.9.1. Consider a finite-duration deterministic signal s(n) = a",0 <n < M—1, Eigenfilters 


corrupted by additive noise v(m) with autocorrelation sequence ry(/) = oipl! | /A- p2). We 
determine and plot the impulse response of an Mth-order matched filter fora = 0.6, M = 
8, on = 0.25, and (a) p = 0.1 and (b) p = —0.8. We first note that the signal vector is 
s = [l aa? --- a’]" and that the noise correlation matrix Ry is Toeplitz with first row 
[ry (0) ry(1) --- ry(7)]. The optimum matched filters are determined by ec = Ry Igy and are 
shown in Figure 6.35. We notice that for p = 0.1 the matched filter looks like the signal because 
the correlation between the samples of the interference is very small; that is, the additive noise 
is close to white. For p = —0.8 the correlation increases, and the shape of the optimum filter 
differs more from the shape of the signal. However, as a result of the increased noise correlation, 
the optimum SNR increases. 
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FIGURE 6.35 


Signal and impulse responses of the optimum matched filter that 
maximizes the SNR in the presence of additive color noise. 


6.9.2 Random Signal in Noise 


In the case of a random signal with known correlation matrix R,, the SNR is 


c#R,c 


cFR,c 
that is, the ratio of two quadratic forms. We again distinguish two cases. 


SNR(c) = 


(6.9.14) 


White noise case. If the correlation matrix of the noise is given by R, = P,I, we have 


1 c# Rye 
SNR(c) = — - 
(©) P, cHe 


(6.9.15) 
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which has the form of Rayleigh’s quotient (Strang 1980; Leon 1998). By using the innova- 
tions transformation ¢ = Qc, where the unitary matrix Q is obtained from the eigende- 
composition R, = QAQ”, the SNR can be expressed as 

1CPAE 1 AGP? +--+ + Amen? 
Py CHE Py i? +--+ lem? 
where 0 < A, <--- < Ay are the eigenvalues of the signal correlation matrix. The SNR is 


SNR(c) = 


(6.9.16) 


maximized if we choose Cy = 1 and cy =--- = Cy_, = O and is minimized if we choose 
Cc; = landc2 =---=Cy = 0. Therefore, for any positive definite matrix R;, we have 
H 
ce’ Rye 
Amin S = Sat (6.9.17) 
cle 


which is known as Rayleigh’s quotient (Strang 1980). This implies that the optimum filter 
c = QC is the eigenvector corresponding to the maximum eigenvalue of Rg, that is, 
C = Qmax (6.9.18) 
and provides a maximum SNR 
Amax 
Py 
where Amax = Am. The obtained optimum filter is sometimes known as an eigenfilter 


(Makhoul 1981). The following example provides a geometric interpretation of these results 
for a second-order filter. 


SNRmax = (6.9.19) 


EXAMPLE 6.9.2. Suppose that the signal correlation matrix Rs is given by (see Example 3.5.1) 


H 
al! e|_1f 1 1][1-e 0 Stal Saale # 
Be alea ler alle real aba ees 


where p = 0.81. To obtain a geometric interpretation, we fix c4e = 1 and try to maximize 
the numerator c# Re > 0 (we assume that R is positive definite). The relation e -- & =1 
represents a circle in the (cy, cz) plane. The plot can be easily obtained by using the parametric 
description c; = cos @ and cy = sin @. To obtain the plot of e# Re = 1, we note that 


eRe = c4 QAQ*e = CAE = A +155 = 1 


where ¢ = Q” c. To plot dees + ieee = 1, we use the parametric description ¢) = cos ¢/./A1 
and Cy = sin b/,/A9. The result is an ellipse in the (€), €2) plane. For ¢2 = 0 we have ¢) = 
1/./Aq, and for ¢) = 0 we have C7 = 1/,/Ap. Since Ay < Az, 2/,/A] provides the length of the 
major axis determined by the eigenvector q; = [1 — yt / V2. Similarly, 2/./A2 provides the 
length of the minor axis determined by the eigenvector q> = [1 1]! //2. The coordinates of the 
ellipse in the (cj, cz) plane are obtained by the rotation transformation c = Qc. The resulting 
circle and ellipse are shown in Figure 6.36. The maximum value of c*Re = Ay a + ABCs on the 
circle e + € = 1 is obtained for €; = 0 and @ = 1, that is, at the endpoint of eigenvector qo, 
and is equal to the largest eigenvalue A. Similarly, the minimum is 4, and is obtained at the tip 
of eigenvector q; (see Figure 6.36). Therefore, the optimum filter is ce = qy and the maximum 
SNR is A2/Py. 


Colored noise case. Using the Cholesky decomposition R, = L,L” of the noise 
correlation matrix, we process the observed signal with the transformation L, | that is, we 
obtain 


Xy(n) = LZ! x(n) = Lj's(m) + LZ! vn) 

= S,(n) + V(n) 
where ¥(n) is white noise with E{¥(n)¥" (n)} = Land E{s,(n)s? (n)} = L7'RsL>". 
Therefore, the optimum matched filter is determined by the eigenvector corresponding 


to the maximum eigenvalue of matrix Ly RL , that is, the correlation matrix of the 
transformed signal s, (7). 


(6.9.20) 


FIGURE 6.36 

Geometric interpretation of the optimization process for the 
derivation of the optimum eigenfilter using isopower contours for 
Xr 1< X2. 


The problem can also be solved by using the simultaneous diagonalization of the signal 
and noise correlation matrices R; and R,, respectively. Starting with the decomposition 
R, = Q,A,QZ , we compute the isotropic transformation 


x(n) £ Ay /7?Q” x(n) 
= Ay? Q" s(n) + Ay? Q” y(n) 4 S(n) + ¥(n) 


where E{¥(n)¥” (n)} = I and E{8(n)8"(n)} = Ay'/?Q4R,Q,A, |” & R;. Since the 
noise vector is white, the optimum matched filter is determined by the eigenvector corre- 
sponding to the maximum eigenvalue of matrix R;. 

Finally, if Ry; = Qs A;Q? , the transformation 


(6.9.21) 


Xys(n) = QU xy (n) = QH8(n) + QZ H(n) 4 8(n) + ¥(n) (6.9.22) 
results in new signal and noise vectors with correlation matrices 
E(8(n)8" (n)} = QP RQ; = Aj (6.9.23) 
E{¥(n)v" (n)} = Q71Q; =1 (6.9.24) 
Therefore, the transformation matrix 
Q£ Q7A5'"Q7 (6.9.25) 


diagonalizes matrices R; and R, simultaneously (Fukunaga 1990). 
The maximization of (6.9.14) can also be obtained by whitening the signal, that is, by 
using the Cholesky decomposition R; = LL? of the signal correlation matrix. Indeed, 


: . H 
using the transformation ¢ = L, ¢, we have 
c#Ryec ce" L.Lic CHE 
CFRy¢  cHL.Ly'R,LS4L4e | CALS R LS 4e 


SNR(c) = (6.9.26) 
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EXAMPLE 6.9.3. The basic problem in many radar detection systems is the separation of a useful 
signal from colored noise or interference background. In several cases the signal is a point target 
(i.e., it can be modeled as a unit impulse) or is random with a flat PSD, that is, Rs = Pol. 
Suppose that the background is colored with correlation ry(i, j) = po, 1<i,j < M, 
which leads to a Toeplitz correlation matrix R,. We determine and compare three filters for 
interference rejection. The first is a matched filter that maximizes the SNR 


P,cH#e 


SNR(c) = 
(c) CHR 


(6.9.27) 
ye 


by setting ¢ equal to the eigenvector corresponding to the minimum eigenvalue of Ry. The 
second approach is based on the method of linear prediction. Indeed, if we assume that the 
interference v; (n) is much stronger than the useful signal s; (n), we can obtain an estimate 0 (1) 
of vj (n) using the observed samples {xx (n)}4 and then subtract 0) (7) from x, (7) to cancel the 


interference. The Wiener filter with desired response y(n) = v1 (1) and input data {xz (n)}5 is 
M-1 
5(n) = — > agxpsi(n) & —a¥&(n) 
k=1 


and is specified by the normal equations 


R,a = —d 
and the MMSE 
Pi = E{\v?}+d"%a 
where (Rx)ij = Elsi xh} X Elva vr, @} 
and d; = E{vyx*, (n)} = Efv(n)v%, @)} 


because the interference is assumed much stronger than the signal. Using the last four equations, 


we obtain 
1 E! 
R, = (6.9.28) 
a 0 


which corresponds to the forward linear prediction error (LPE) filter discussed in Section 6.5.2. 
Finally, for the sake of comparison, we consider the binomial filters Hjy(z) = (1 — z—!)™ that 
are widely used in radar systems for the elimination of stationary (i.e., nonmoving) clutter. Figure 
6.37 shows the magnitude response of the three filters for p = 0.9 and M = 4. We emphasize that 


1 1 FIGURE 6.37 
Matched filter Comparison of frequency 
__-~- LPE filter J responses of matched filter, 
veeeeees Binomial filter | prediction error filter, and 
binomial interference rejection 
| filter. 
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the FLP method is suboptimum compared to matched filtering. However, because the frequency 
response of the FLP filter does not have the deep zero notch, we use it if we do not want to lose 
useful signals in that band (Chiuppesi et al. 1980). 


6.10 SUMMARY 


In this chapter, we discussed the theory and application of optimum linear filters designed by 
minimizing the MSE criterion of performance. Our goal was to explain the characteristics of 
each criterion, emphasize when its use made sense, and illustrate its meaning in the context 
of practical applications. 

We started with linear processors that formed an estimate of the desired response by 
combining a set of different signals (data) and showed that the parameters of the optimum 
processor can be obtained by solving a linear system of equations (normal equations). The 
matrix and the right-hand side vector of the normal equations are completely specified 
by the second-order moments of the input data and the desired response. Next, we used 
the developed theory to design optimum FIR filters, linear signal estimators, and linear 
predictors. 

We emphasized the case of stationary stochastic processes and showed that the resulting 
optimum estimators are time-invariant. Therefore, we need to design only one optimum filter 
that can be used to process all realizations of the underlying stochastic processes. Although 
another filter may perform better for some realizations, that is, the estimated MSE is smaller 
than the MMSE, on average (i.e., when we consider all possible realizations), the optimum 
filter is the best. 

We showed that the performance of optimum linear filters improves as we increase the 
number of filter coefficients. Therefore, the noncausal IIR filter provides the best possible 
performance and can be used as a yardstick to assess other filters. Because HR filters 
involve an infinite number of parameters, their design involves linear equations with an 
infinite number of unknowns. For stationary processes, these equations take the form of a 
convolution equation that can be solved using z-transform techniques. If we use a pole-zero 
structure, the normal equations become nonlinear and the design of the optimum filter is 
complicated by the presence of multiple local minima. 

Then we discussed the design of optimum filters for inverse system modeling and blind 
deconvolution, and we provided a detailed discussion of their use in the important practical 
application of channel equalization for data transmission systems. 

Finally, we provided a concise introduction to the design of optimum matched filters 
and eigenfilters that maximize the output SNR and find applications for the detection of 
signals in digital communication and radar systems. 


PROBLEMS 


6.1 Let x be a random vector with mean E{x}. Show that the linear MMSE estimate ) of a ran- 
dom variable y using the data vector x is given by } = yo + cx, where Yo = E{y} - 
c# E{x},c = R7!d, R = E{xx”}, andd = E{xy*}. 


6.2. Consider an optimum FIR filter specified by the input correlation matrix R = Toeplitz {1, i} 
and cross-correlation vector d = [1 5)". 


(a) Determine the optimum impulse response ¢o and the MMSE Po. 
(b) Express cy and Pz in terms of the eigenvalues and eigenvectors of R. 


6.3 Repeat Problem 6.2 for a third-order optimum FIR filter. 
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6.4 


6.5 


6.6 


6.7 


6.8 


6.9 


A process y(n) with the autocorrelation ry(J) = all,-l <a <1, is corrupted by additive, 
uncorrelated white noise v(m) with variance ore To reduce the noise in the observed process 
x(n) = y(n) + v(n), we use a first-order Wiener filter. 


(a) Express the coefficients cy, and cg,2 and the MMSE Pp in terms of parameters a and Cee 

(b) Compute and plot the PSD of x(”) and the magnitude response |Co (e/ ®)| of the filter when 
o = 2, for both a = 0.8 and a = —0.8, and compare the results. 

(c) Compute and plot the processing gain of the filter for a = —0.9, —0.8, —0.7,...,0.9 asa 
function of a and comment on the results. 


Consider the harmonic process y(n) and its noise observation x(n) given in Example 6.4.1. 


(a) Show that ry() = 5A? cos wol. 

(b) Write a Matlab function h = opt_fir(A,f0,var_v,M) to design an Mth-order op- 
timum FIR filter impulse response h(n). Use the toeplitz function from MATLAB to 
generate correlation matrix R. 

(c) Determine the impulse response of a 20th-order optimum FIR filter for A = 0.5, fg = 0.05, 
and of = 0.5. 

(d) Using MATLAB, determine and plot the magnitude response of the above-designed filter, 
and verify your results with those given in Example 6.4.1. 


Consider a “desired” signal s(m) generated by the process s(n) = —0.8w(n — 1) + w(n), where 
w(n) ~ WN(O, o>). This signal is passed through the causal system H(z) = 1— 0.9z—! whose 
output y(n) is corrupted by additive white noise v(n) ~ WN(0, 0°). The processes w(n) and 
v(n) are uncorrelated with ae = 0.3 and ae = 0.1. 


(a) Design a second-order optimum FIR filter that estimates s(n) from the signal x(n) = 
y(n) + v(n) and determine cy and Po. 

(b) Plot the error performance surface, and verify that it is quadratic and that the optimum filter 
points to its minimum. 

(c) Repeat part (a) for a third-order filter, and see whether there is any improvement. 


Repeat Problem 6.6, assuming that the desired signal is generated by s(n) = —0.8s(n—1)+w(n). 
Repeat Problem 6.6, assuming that H(z) = 1. 

A stationary process x(n) is generated by the difference equation x(n) = px(n — 1) + w(n), 
where w(n) ~ WN(0, o2,). 

(a) Show that the correlation matrix of x(n) is given by 


2: 
R, = ow 5 Toeplitz{1, 1 Ren 
l1—p 


aay 


(b) Show that the Mth-order FLP is given by at”? = —p,a<“? =0 for k > 1 and the MMSE 


ee pl = 2 
is Py = Oy. 


6.10 Using Parseval’s theorem, show that (6.4.18) can be written as (6.4.21) in the frequency domain. 


6.11 By differentiating (6.4.21) with respect to H(eJ®), derive the frequency response function 


H,(e/”) of the optimum filter in terms of Ry (e/®) and Ry (e/®). 


6.12 A conjugate symmetric linear smoother is obtained from (6.5.12) when M = 2L andi = L. If 


the process x(n) is stationary, then, using RJ = JR*, show that ¢ = Je. 


6.13 Let Q and A be the matrices from the eigendecomposition of R, that is, R = 0Ag”. 


(a) Substitute R into (6.5.20) and (6.5.27) to prove (6.5.43) and (6.5.44). 


(b) Generalize the above result for a jth-order linear signal estimator ce) (n); that is, prove that 
, M+, 
cD (ny = Pn) > GiGi, 
: Xi : 
i=1 
6.14 Let R(n) be the inverse of the correlation matrix R(n) given in (6.5.11). 


(a) Using (6.5.12), show that the diagonal elements of R(x) are given by 


i 1 : 
(RM@))i,i = POM 1<i<M+1 
(b) Furthermore, show that 
O@y= 1<i<M+1 
(R(n))i,i 


where r; (n) is the i-th column of R(n). 


6.15 The first five samples of the autocorrelation sequence of a signal x(n) are r(0) = 1,r(1) = 
0.8, r(2) = 0.6, r(3) = 0.4, andr(4) = 0.3. Compute the FLP, the BLP, the optimum symmetric 
smoother, and the corresponding MMSE (a) by using the normal equations method and (b) by 
using the inverse of the normal equations matrix. 


6.16 For the symmetric, Toeplitz autocorrelation matrix R = Toeplitz{r(0),r(1), r(2)} = r(0)x 
Toeplitz{1, 01, e2} with R = LDL” and D = diag{&1, &, €3}, the following conditions are 
equivalent: 


e R is positive definite. 
e €; > Oforl <i <3. 
e |kj| <1 forl <i <3. 


Determine the values of p; and p> for which R is positive definite, and plot the corresponding 
area in the (01, 02) plane. 


6.17 Prove the first equation in (6.5.45) by rearranging the FLP normal equations in terms of the 
unknowns Pi (n), a,(n), ..., @yy(n) and then solve for Pi (n), using Cramer’s rule. Repeat the 
procedure for the second equation. 


6.18 Consider the signal x(n) = y(n) + v(n), where y(n) is a useful random signal corrupted by 
noise v(m). The processes y(n) and v() are uncorrelated with PSDs 


W 
1 OS oles 
Ry(e/®) = = 
0 —<|lol<az 
2 
T 
1 galls, 
and Ry(e/®) = i R TU ak 
< ae = < 
S lol< 7 an ys lolse 


respectively. (a) Determine the optimum IIR filter and find the MMSE. (b) Determine a third- 
order optimum FIR filter and the corresponding MMSE. (c) Determine the noncausal optimum 
FIR filter defined by 


p(n) = h(-1)x(n +1) + h(O)x(n) + AC) x(n — 1) 
and the corresponding MMSE. 
6.19 Consider the ARMA(1, 1) process x(n) = 0.8x(n — 1) + w(n) + 0.5w(n — 1), where w(n) ~ 
WGN(0, 1). (a) Determine the coefficients and the MMSE of (1) the one-step ahead FLP x(n) = 


a, x(n — 1) +.ayx(n — 2) and (2) the two-step ahead FLP %(n + 1) = ay x(n — 1) +-apx(n — 2). 
(b) Check if the obtained prediction error filters are minimum-phase, and explain your findings. 
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6.20 Consider a random signal x(n) = s(n) + v(n), where v(n) ~ WGN(O, 1) and s(n) is the AR(1) 
process s(n) = 0.9s(n — 1) + w(n), where w(n) ~ WGN(0, 0.64). The signals s(m) and v(n) 
are uncorrelated. (a) Determine and plot the autocorrelation rs (/) and the PSD Rs (e/ ®) of s(n). 
(b) Design a second-order optimum FIR filter to estimate s(n) from x(n). What is the MMSE? 
(c) Design an optimum IIR filter to estimate s(n) from x(n). What is the MMSE? 


6.21 A useful signal s(n) with PSD Rs(z) = [U0 — 0.9z—!)(1 — 0.9z)]~! is corrupted by additive 
uncorrelated noise v(n) ~ WN(0, o2). (a) The resulting signal x(n) = s(n) + v(n) is passed 
through a causal filter with system function H(z) = (1 — 0.82—!)71, Determine (1) the SNR 
at the input, (2) the SNR at the output, and (3) the processing gain, that is, the improvement 
in SNR. (b) Determine the causal optimum filter and compare its performance with that of the 
filter in (a). 


6.22 Auseful signal s(n) with PSD Rs (z) = 0.36[(1 — 0.8z71)(1 —0.8z)]7! is corrupted by additive 
uncorrelated noise v(n) ~ WN(0, 1). Determine the optimum noncausal and causal IIR filters, 
and compare their performance by examining the MMSE and their magnitude response. Hint: 
Plot the magnitude responses on the same graph with the PSDs of signal and noise. 


6.23 Consider a process with PSD R,(z) = o” Hy, (z)Hy(z71). Determine the D-step ahead linear 
predictor, and show that the MMSE is given by PY) = o? byai |hy|2(n). Check your results 
by using the PSD Ry (z) = (1 — a2) [C1 — az!) — az)]7!. 


6.24 Let x(n) = s(n) + v(n) with Ry(z) = 1, Rsy(z) = O, and 
0.75 
(1 —0.5z~!)(1 — 0.5z) 


Determine the optimum filters for the estimation of s(n) and s(n — 2) from {x(k)}",, and the 
corresponding MMSEs. 


Rs(z) = 


6.25 For the random signal with PSD 
(1 — 0.2z~!)(1 — 0.2z) 
(1 — 0.92—!)(1 — 0.9z) 


determine the optimum two-step ahead linear predictor and the corresponding MMSE. 


Rx) = 


6.26 Repeat Problem 6.25 for 
1 


Ry (Zz) = = = 
(1 — 0.2z-!)(1 — 0.2z)(1 — 0.9z—!)(1 — 0.92) 


6.27 Let x(n) = s(n) + v(n) with v(n) ~ WN(O, 1) and s(n) = 0.6s(n — 1) + w(n), where 
w(n) ~ WN(0, 0.82). The processes s() and v(n) are uncorrelated. Determine the optimum 
filters for the estimation of s(n), s(n + 2), and s(n — 2) from {x(k)}" ,, and the corresponding 
MMSEs. 


6.28 Repeat Problem 6.27 for Rs (z) = [1 — 0.527!)(1 _ 0.5z)]74, Ry(z) =5, and Rsy(z) = 0. 


6.29 Consider the random sequence x(n) generated in Example 6.5.2 
x(n) = w(n) + Sw(n —~1) 


where w(n) is WN(O, 1). Generate K = 100 sample functions {we (ny k =1,...,K of 


n=0’ 
w(n), in order to generate K sample functions {x;, lair k=1,...,K of x(n). 
(a) Use the second-order FLP a; to obtain predictions (at (n)}*_, of x, (n), fork =1,..., K. 
Then determine the average error 
I N 
pi — a 2 ben —ia@)? k=1,...,K 
n= 


and plot it as a function of k. Compare it with pi ; 


(b) Use the second-order BLP bx to obtain predictions ee) rg k=1,...,K of x(n). 
Then determine the average error 


1 N-2 
p> — AL 2, la) Ae k=1,...,K 
n= 


and plot it as a function of k. Compare it with p>. 


(c) Use the second-order symmetric linear smoother c; to obtain smooth estimates {xp (nN 
of x,(n) fork = 1,..., K. Determine the average error 
N-1 


A 


1 P 
Ps a7 Se lxe(n) — FEM)? k= 1... K 
n=1 


and plot it as a function of k. Compare it with PS. 


6.30 Let x(n) = y(n) + v(n) be a wide-sense stationary process. The linear, symmetric smoothing 
filter estimator of y(7) is given by 


L 


$(n) = a c§x(n — k) 
k=-L 


(a) Determine the normal equations for the optimum MMSE filter. 

(b) Show that the smoothing filter c§ has linear phase. 

(c) Use the Lagrange multiplier method to determine the MMSE Mth-order estimator }(n) = 
cx(n), where M = 2L + 1, when the filter vector ¢ is constrained to be conjugate sym- 
metric, that is, ec = Jc*. Compare the results with those obtained in part (a). 


6.31 Consider the causal prediction filter discussed in Example 6.6.1. To determine HL} (z), first 
compute the causal part of the z-transform Dai (z)]4. Next compute HL} (z) by using (6.6.21). 


(a) Determine ni?!) (n). 
(b) Using the above nl?! (n), show that 


[D] __ ,_ 5/4)2D 
Per) =1- 34) 
6.32 Consider the causal smoothing filter discussed in Example 6.6.1. 
(a) Using [r}(D]4 = ryw@ + D)u(l), D <0, show that [7,,,, ()] can be put in the form 
yw Ole = 3(S)!+Pud + D) + 324)lu@ —ud + dD) D <0 


(b) Hence, show that [Ry w+ is given by 


5 og? ee 
: D io 
[Ryw@l+ = 542441 +2027) y 2 

5% 1=0 


(c) Finally using (6.6.21), prove (6.6.54). 


6.33 In this problem, we will prove (6.6.57) 


(a) Starting with (6.6.42), show that [Ryw (z)]+ can also be put in the form 


' 3 zP BP sg? 
[Ryw (i+ = + 
yw rE 5 (Sn 1—2z-! 


(b) Now, using (6.6.21), show that 


3 f2Pa- $271) + 2,D-1 
(l= $274) - 227!) 
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hence, show that 


7D 


: [D] 
lim AQ (z)= 
D>-co * 40 F = S271) — 27-1) 


i 2 Ac (z) 
(c) Finally, show that lim P!?! = Pye. 
D->co 


6.34 Consider the block diagram of a simple communication system shown in Figure 6.38. The 
information resides in the signal s(n) produced by exciting the system Hj (z) = 1/(1+0.95z7 !) 
with the process w(n) ~ WGN (0, 0.3). The signal s (7) propagates through the channel H(z) = 
1/(. —0.85z7 1) and is corrupted by the additive noise process v(n) ~ WGN(0, 0.1), which is 
uncorrelated with w(n). (a) Determine a second-order optimum FIR filter (M = 2) that estimates 
the signal s(n) from the received signal x(n) = z(n) + v(n). What is the corresponding MMSE 
Po? (b) Plot the error performance surface and verify that the optimum filter corresponds to the 
bottom of the “bowl.” (c) Use a Monte Carlo simulation (100 realizations with a 1000-sample 
length each) to verify the theoretically obtained MMSE in part (a). (d) Repeat part (a) for M = 3 
and check if there is any improvement. Hint: To compute the autocorrelation of z(7), notice that 
the output of H(z) H(z) is an AR(2) process. 


Optimum 
filter 


e(n) 


w(n) 


FIGURE 6.38 
Block diagram of simple communication system used in Problem 6.34. 


6.35 Write a program to reproduce the results shown in Figure 6.35 of Example 6.9.1. (a) Produce 
plots for p = 0.1, —0.8, 0.8. (b) Repeat part (a) for M = 16. Compare the plots obtained in (a) 
and (b) and justify any similarities or differences. 


6.36 Write a program to reproduce the plot shown in Figure 6.36 of Example 6.9.2. Repeat for 
p = —0.81 and explain the similarities and differences between the two plots. 


6.37 In this problem we study in greater detail the interference rejection filters discussed in Example 
6.9.3. (a) Shows that SNRs for the matched filter and FLP filter are given by 


M Matched filter FLP filter 
1 1+ p? 
2 NE: oe p 
l-p 1 — p2 


2 1+ p2 + 3p* + p® 
24+ p40 —V148p-%) = (#? — 1) (p4 = 1) 


and check the results numerically. (b) Compute and plot the SNRs and compare the performance 
of both filters for M = 2,3,4 and p = 0.6, 0.8, 0.9, 0.95, 0.99, and 0.995. For what values 
of e and M do the two methods give similar results? Explain your conclusions. (c) Plot the 
magnitude response of the matched, FLP, and binomial filters for M = 3 and p = 0.9. Why 
does the optimum matched filter always have some nulls in its frequency response? 


6.38 Determine the matched filter for the deterministic pulse s(n) = cos won for0 <n < M—1and 
zero elsewhere when the noise is (a) white with variance oF and (b) colored with autocorrelation 
ry(l) = o2 pl! /(1—p2),—1 < p < 1.Plot the frequency response of the filter and superimpose 


6.39 


6.40 


6.41 


it on the noise PSD, for wm = 2/6, M = 12, o = 1, and p = 0.9. Explain the shape of the 
obtained response. (c) Study the effect of the SNR in part (a) by varying the value of Ga. (d) 
Study the effect of the noise correlation in part (c) by varying the value of p. 


Consider the equalization experiment in Example 6.8.1 with M = 11 and D = 7. (a) Compute 
and plot the magnitude response | H (e/“)| of the channel and |Cy (e/®)| of the optimum equalizer 
for W = 2.9,3.1,3.3, and 3.5 and comment upon the results. (b) For the same values of 
W, compute the spectral dynamic range |H(es ®) | max /|H (e/ ”)|min Of the channel and the 
eigenvalue spread Amax /Amin Of the M x M input correlation matrix. Explain how the variation 
in one affects the other. 


In this problem we clarify some of the properties of the MSE equalizer discussed in Example 
6.8.1. (a) Compute and plot the MMSE Py as a function of M, and recommend how to choose 
a “reasonable” value. (b) Compute and plot Po as a function of the delay D for0 < D < 11. 
What is the best value of D? (c) Study the effect of input SNR upon P, for M = 11 and D=7 
by fixing os = | and varying a, 


In this problem we formulate the design of optimum linear signal estimators (LSE) using a 
constrained optimization framework. To this end we consider the estimator e(n) = cox(n) + 
cee Cyx(n —M) 4 c4 x(n) and we wish to minimize the output power E{\e(n)|7} =c Re. 
To prevent the trivial solution c = 0 we need to impose some constraint on the filter coefficients 
and use Lagrange multipliers to determine the minimum. Let u; be an M x 1 vector with 
one at the ith position and zeros elsewhere. (a) Show that minimizing ce” Re under the linear 
constraint ule = | provides the following estimators: FLP if i = 0, BLP if i = M, and 
linear smoother if i 4 0, M. (b) Determine the appropriate set of constraints for the L-steps 
ahead linear predictor, defined by co = 1 and {cy = OWS and solve the corresponding 
constrained optimization problem. Verify your answer by obtaining the normal equations using 
the orthogonality principle. (c) Determine the optimum linear estimator by minimizing c Re 
under the quadratic constraints ee = Lande” We = 1(Wisa positive definite matrix) which 
impose a constraint on the length of the filter vector. 
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The design and application of optimum filters involves (1) the solution of the normal 
equations to determine the optimum set of coefficients, (2) the evaluation of the cost function 
to determine whether the obtained parameters satisfy the design requirements, and (3) the 
implementation of the optimum filter, that is, the computation of its output that provides 
the estimate of the desired response. 

The normal equations can be solved by using any general-purpose routine for linear 
simultaneous equations. However, there are several important reasons to study the normal 
equations in greater detail in order to develop efficient, special-purpose algorithms for their 
solution. First, the throughput of several real-time applications can only be served with serial 
or parallel algorithms that are obtained by exploiting the special structure (e.g., Toeplitz) of 
the correlation matrix. Second, sometimes we can develop order-recursive algorithms that 
help us to choose the correct filter order or to stop the algorithm before the manifestation 
of numerical problems. Third, some algorithms lead to intermediate sets of parameters that 
have physical meaning, provide easy tests for important properties (e.g., minimum phase), 
or are useful in special applications (e.g., data compression). Finally, sometimes there is a 
link between the algorithm for the solution of the normal equations and the structure for 
the implementation of the optimum filter. 

In this chapter, we present different algorithms for the solution of the normal equations, 
the computation of the minimum mean square error (MMSE), and the implementation of 
the optimum filter. We start in Section 7.1 with a discussion of some results from matrix 
algebra that are useful for the development of order-recursive algorithms and introduce an 
algorithm for the order-recursive computation of the LDL” decomposition, the MMSE, 
and the optimum estimate in the general case. In Section 7.2, we present some interesting 
interpretations for the various introduced algorithmic quantities and procedures that provide 
additional insight into the optimum filtering problem. 

The only assumption we have made so far is that we know the required second-order 
statistics; hence, the results apply to any linear estimation problem: array processing, fil- 
tering, and prediction of nonstationary or stationary processes. In the sequel, we impose 
additional constraints on the input data vector and show how to exploit them in order to sim- 
plify the general algorithms and structures or specify new ones. In Section 7.3, we explore 
the shift invariance of the input data vector to develop a time-varying lattice-ladder structure 
for the optimum filter. However, to derive an order-recursive algorithm for the computation 
of either the direct or lattice-ladder structure parameters of the optimum time-varying filter, 
we need an analytical description of the changing second-order statistics of the nonstation- 
ary input process. Recall that in the simplest case of stationary processes, the correlation 
matrix is constant and Toeplitz. As a result, the optimum FIR filters and predictors are 
time-invariant, and their direct or lattice-ladder structure parameters can be computed (only 
once) using efficient, order-recursive algorithms due to Levinson and Durbin (Section 7.4) 
or Schiir (Section 7.6). Section 7.5 provides a derivation of the lattice-ladder structures for 
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optimum filtering and prediction, their structural and statistical properties, and algorithms 
for transformations between the various sets of parameters. Section 7.7 deals with efficient, 
order-recursive algorithms for the triangularization and inversion of Toeplitz matrices. 

The chapter concludes with Section 7.8 which provides a concise introduction to the 
Kalman filtering algorithm. The Kalman filter provides a recursive solution to the minimum 
MSE filtering problem when the input stochastic process is described by a known state space 
model. This is possible because the state space model leads to a recursive formula for the 
updating of the required second-order moments. 


7.1 FUNDAMENTALS OF ORDER-RECURSIVE ALGORITHMS 


In Section 6.3, we introduced a method to solve the normal equations and compute the 
MMSE using the LDL” decomposition. The optimum estimate is computed as a sum of 
products using a linear combiner supplied with the optimum coefficients and the input data. 
The key characteristic of this approach is that the order of the estimator should be fixed 
initially, and in case we choose a different order, we have to repeat all the computations. 
Such computational methods are known as fixed-order algorithms. 

When the order of the estimator becomes a design variable, we need to modify our 
notation to take this into account. For example, the mth-order estimator c,,(7) is obtained 
by minimizing E{|e(n)|"}, where 


€m(n) = y(n) — Im(n) (7.1.1) 

3m(n) & ef (2) Xm (1) (7.1.2) 

mn) £ [ch a) ch?) --- ny]? (7.1.3) 
Xm(n) = [x1 (n) x2(n) ++» xm(n)I" (7.1.4) 


In general, we use the subscript m to denote the order of a matrix or vector and the superscript 
m to emphasize that a scalar is acomponent of an m x 1 vector. We note that these quantities 
are functions of time n, but sometimes we do not explicitly show this dependence for the 
sake of simplicity. 

If the mth-order estimator c,, (1) has been computed by solving the normal equations, it 
seems to be a waste of computational power to start from scratch to compute the (m + 1)st- 
order estimator ¢,,+1 (1). Thus, we would like to arrange the computations so that the results 
for order m, that is, ¢,,(n) or Jm(n), can be used to compute the estimates for order m + 1, 
that is, ¢m+1(71) Or 3m+1(2). The resulting procedures are called order-recursive algorithms 
or order-updating relations. Similarly, procedures that compute ¢,,(m + 1) from ¢c,,(7) or 
Jm(n + 1) from ¥,,(1) are called time-recursive algorithms or time-updating relations. 
Combined order and time updates are also possible. All these updates play a central role in 
the design and implementation of many optimum and adaptive filters. 

In this section, we derive order-recursive algorithms for the computation of the LDL” 
decomposition, the MMSE, and the MMSE optimal estimate. We also show that there is no 
order-recursive algorithm for the computation of the estimator parameters. 


7.1.1 Matrix Partitioning and Optimum Nesting 


We start by introducing some notation that is useful for the discussion of order-recursive 


algorithms.’ Notice that if the order of the estimator increases from m to m + 1, then the 
input data vector is augmented with one additional observation x,,41. We use the notation 


"all quantities in Sections 7.1 and 7.2 are functions of the time index n. However, for notational simplicity we do 
not explicitly show this dependence. 


me to denote the vector that consists of the first m components and veal for the last m 


components of vector x, ,. The same notation can be generalized to matrices. The m x m 


matrix Re obtained by the intersection of the first m rows and columns of R»+1, is 


known as the mth-order leading principal submatrix of R41. In other words, if rj; are 
the elements of R,,+1, then the elements of Rm are rjj, 1 <i, j < m. Similarly, Lh 
denotes the matrix obtained by the intersection of the last m rows and columns of Ry,+1. 


For example, if m = 3 we obtain 


[3] 
R, 


ry 112 113 | 114 
ra1.| 722 123 | r24 
r31 | 732 133 | 134 
ray | T4243 


Ry (7.1.5) 


(3) 
Rj 


which illustrates the upper left corner and lower right corner partitionings of matrix R4q. 
Since ie = Xm, We can easily see that the correlation matrix can be partitioned as 


Xm Ry», r, 
Rnut = E qo yt = (7.1.6) 
™ [ea aa ie 5 
where r= Exes 4) (7.1.7) 
and Pm = E{|xm41") (7.1.8) 
The result 
x = Xm > Rn = RI), (7.1.9) 


is known as the optimum nesting property and is instrumental in the development of order- 


[m] 


recursive algorithms. Similarly, we can show that x,,_, = Xm implies 


m+ 
x d, 
dingt = E{Xm+1y"} = E | d >| = 7 (7.1.10) 
Xm+1 din+1 
or x") =x, > d, =a!” (7.1.11) 
a mi <m+1 ent 


that is, the right-hand side of the normal equations also has the optimum nesting property. 

Since (7.1.9) and (7.1.11) hold for all 1 < m < M, the correlation matrix Rj and the 
cross-correlation vector dj contain the information for the computation of all the optimum 
estimators ¢,, for 1 <m < M. 


7.1.2 Inversion of Partitioned Hermitian Matrices 


Suppose now that we know the inverse R;! of the leading principal submatrix RI”! 1=Rn 


of matrix R,,;1 and we wish to use it to compute R,, : ,; without having to repeat all the 
work. Since the inverse Q,,+1 of the Hermitian matrix R,,+ 1 is also Hermitian, it can be 
partitioned as 


Qn dn 
Qnit=] 4 (7.1.12) 
Gm dm 


Using (7.1.6), we obtain 


Rn r? Qn qm In Om 
RO = ve 2 (7.1.13) 
Or, tm Pm} Lm dm | [On I 
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After performing the matrix multiplication, we get 


RnQn +ro.q? =In (7.1.14) 

Qn + prgt = 07 (7.1.15) 

RinQn + Fp, = On (7.1.16) 

ran + pdm = 1 (7.1.17) 

where 0,,, is the m x 1 zero vector. If matrix R,, is invertible, we can solve (7.1.16) for qm 
Gn = —R,',4m (7.1.18) 


and then substitute into (7.1.17) to obtain gm, as 
1 


Ply — Thi Rin Tn 
assuming that the scalar quantity p>, = es R,, as a # 0. Substituting (7.1.19) into (7.1.18), 
we obtain 


din (7.1.19) 


—R-!r> 
= Se (7.1.20) 
Pm —Tm Rn Tin 
which, in conjunction with (7.1.14), yields 
Ro! ro (Ro pe yt 
Qn = R,! _ R,!r? qi = R;! + Rn Tm(Bin Fm) (7.1.21) 


ph, — nhl Rn ith 
We note that (7.1.19) through (7.1.21) express the parts of the inverse matrix Q,,,4 in terms 


of known quantities. For our purposes, we express the above equations in a more convenient 
form, using the quantities 


Dm = [be be ... p™ JF & Ro 1rd (7.1.22) 
and a? & p> — yA R 1p> = p> 4 Ap, (7.1.23) 


Thus, if matrix R,, is invertible and a> # 0, combining (7.1.13) with (7.1.19) through 
(7.1.23), we obtain 


-l 
Rn re R;! On 1 b 
—| oe m a3 m seni m H 
iti | a ake 7 ara [b# 1] (7.1.24) 


m 


which determines R,, a from R,, ' by using a simple rank-one modification known as the 
matrix inversion by partitioning lemma (Noble and Daniel 1988). 


Another useful expression for a, is 
b det Rn+ 1 


is Exh 
um det Ryn 


which reinforces the importance of the quantity a> for the invertibility of matrix R,,+1 (see 
Problem 7.1). 


(7.1.25) 


EXAMPLE 7.1.1. Given the matrix 


and the inverse matrix 


compute matrix R,” ie using the matrix inversion by partitioning lemma. 


Solution. To determine R3 ' from the order-updating formula (7.1.24), we first compute 


1 
1 4 -2||-a 1} 1 
—lb 3 
bo = -R =-= =-- 
ms ue ale | H aH 
2 
1 1 20 
and a = 3 + 13Mby = | E ‘ll 


using (7.1.22) and (7.1.23). Then we compute 
1 
is —2 | 3 27-12 -3 
1 27 1 4 1 
cal a am 4 |= (oi. “Be 24a 
3 3 20 | _9 | 9 9 20 
0 0) 0 l =3. “12 27 


using (7.1.24). The reader can easily verify the above calculations using MATLAB. 


Following a similar approach, we can show (see Problem 7.2) that the inverse of the 
lower right corner partitioned matrix R,,.1 can be expressed as 


f .fHq7! H 
Py, %v 0 0 1 
RAPP |] ¢ Peele [1 af] (7.1.26) 
met f f f\-1 af | am m 
In R,, On (R,,) ie 
where am & fa a ... @™yFS —RE yet (7.1.27) 
: 3 a‘ tie ate : det Ry+1 
On = Bin — Fn Rin) En = Bin + Ep im = (7.1.28) 


and the relationship (7.1.26) exists if matrix Ri is invertible and a % 0.A similar set of 
formulas can be obtained for arbitrary matrices (see Problem 7.3). 


Interpretations. The vector b,,, defined by (7.1.22), is the MMSE estimator of obser- 
vation xX, 1 from data vector x,,. Indeed, if 


eS Fe = ee Ee (7.1.29) 


we can show, using the orthogonality principle E {x,, er) = 0, that b,, results in the MMSE 
given by 
p> = p> +b#r> = a (7.1.30) 


mum m 
Similarly, we can show that a,,, defined by (7.1.27), is the optimum estimator of x; based on 
Kim & [x2 x3 +++ Xm41]". By using the orthogonality principle, E{x,,e!*} = 0, the MMSE 
is 


PE = ol tritan =a, (7.1.31) 


Tf Xm41 = x(n) x(n—-1) +--+ x(n — m)]", then bm provides the backward linear predictor 
(BLP) and a,, the forward linear predictor (FLP) of the process x(n) from Section 6.5. For 
convenience, we always use this terminology even if, strictly speaking, the linear prediction 
interpretation is not applicable. 


7.1.3 Levinson Recursion for the Optimum Estimator 


We now illustrate how to use (7.1.24) to express the optimum estimator ¢c,,41 in terms of 
the estimator ¢,,. Indeed, using (7.1.24), (7.1.10), and the normal equations Ry,¢ = dm, 
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we have 


-1 
Cn+1 = R,414m41 


Ro! On | | din 1 |De 
me m a b# 
Ee 0 Jen = ; I " 


| Ra dn Z Bn | bE din + din+i 
10 1 ab 


m 


or more concisely 


c b 
emi =| | +] | ke, (7.1.32) 
0 1 
where the quantities 
BS 
kn, = (7.1.33) 
an 
and Bo, = Bind + ding (7.1.34) 


contain the “new information” d+ (the new component of d+). By using (7.1.22) and 
Ryn = dm, alternatively Bf, can be written as 


bH 


Pak, Cat Gadi (7.1.35) 


We will use the term Levinson recursion for the order-updating relation (7.1.32) because a 
similar recursion was introduced as part of the celebrated algorithm due to Levinson (see 
Section 7.3). However, we stress that even though (7.1.32) is order-recursive, the parameter 
vector C41 does not have the optimum nesting property, that is, ot 17 em- 

Clearly, if we know the vector b,,, we can determine ¢,,1, using (7.1.32); however, 
its practical utility depends on how easily we can obtain the vector b,,. In general, by, 
requires the solution of anm x m linear system of equations, and the computational savings 
compared to direct solution of the (m + 1)st-order normal equations is insignificant. For 
the Levinson recursion to be useful, we need an order recursion for vector b,,. Since matrix 
Ryn has the optimum nesting property, we need to check whether the same is true for 


the right-hand side vector in Ry,+1bm+1 = -r> at From the definition r> FS E{Xx* 


mee 
: bfm] b bln] b ; 
we can easily see thatr,,.,; #Y,, andr,,.; # ¥,,. Hence, in general, we cannot find 


a Levinson recursion for vector b,,. This is possible only in optimum filtering prob- 
lems in which the input data vector x,,() has a shift-invariance structure (see Section 
7.3). 


EXAMPLE 7.1.2. Use the Levinson recursion to determine the optimum linear estimator ¢3 
specified by the matrix 


Re Ne 


R3 = 


RON Wile 
—— 


re 
as 2 
in Example 7.1.1 and the cross-correlation vector 


d3 =[124]" 


(1) 


Solution. Form = 1 we have rjjc; ° = dj, which gives en = 1. Also, from (7.1.32) and 


(7.1.34) we obtain k6 = ct) = 1 and BG = d = 1. Finally, from k6 = B6/a, we get a? = 1. 


To obtain c2, we need be Ki, e and a ;. We have 


1 
(1) b (1) 2 1 
py = — pv =-2 =-- 
P19 ry 1 1 2 
Bo = bay + ad =-10)42=2 
La 21 el 2=—% =%5 
1 
ab = pt rdol 14a} 
By 
ay 


and therefore 


EBB 


To determine c3, we need bo, BS, and ae: To obtain bo, we solve the linear system 


1 (2) 1 
Roby = —r? or 2} | 1 ieee =>b oes : 
eoees hae 1 vp | Mg) 2 eC 
2 2 2 
and then compute 
c T 1 
65 =b,d2+d3=--[1 4] . +4=3 
1 1 20 
b b bT 1 1 
Ay = Py +¥9 bo =1+ [5 |.) (-3) <3 
poke 
7 ab 20 


0 pee 
9 
7) bo 81 1 
= is =| 2 peg eee 4 
= eli] “ Ali oa Nec (Oe 
1 


which agrees with the solution obtained by solving R3c3 = d3 using the function c3=R3\d3. 
We can also solve this linear system by developing an algorithm using the lower partitioning 
(7.1.26) as discussed in Problem 7.4. 


Matrix inversion and the linear system solution for m = | are trivial (scalar division 
only). If Ry is strictly positive definite, that is, Ry = Re is positive definite for all 
1 < m < M, the inverse matrices R;! and the solutions of Ry,¢, = dy, 2 < m < M, can 
be determined using (7.1.22) and the Levinson recursion (7.1.32) form = 1,2,..., M—1. 
However, in practice using the LDL” provides a better method for performing these 
computations. 


7.1.4 Order-Recursive Computation of the LDL” Decomposition 


We start by showing that the LDL” decomposition can be computed in an order-recursive 
manner. The procedure is developed as part of a formal proof of the LDL” decomposition 
using induction. 

For M = 1, the matrix R; is a positive number 7}; and can be written uniquely in 
the form rj; = 1-&, - 1 > 0. As we increment the order m, the (m + 1)st-order principal 
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submatrix of R,, can be partitioned as in (7.1.6). By the induction hypothesis, there are 
unique matrices L,, and D,, such that 


Rn= LnDnL? (7.1.36) 
We next form the matrices 
Ling = Es ‘ Dna = ve (7.1.37) 
yd OF Em41 
and try to determine the vector 1, and the positive number &,,,, ; so that 
Rast = Lexi Daal (7.1.38) 
Using (7.1.6) and (7.1.36) through (7.1.38), we see that 
(LinDin Im = r°, (7.1.39) 
2°, = 1 Dak + Emit» Em41 > 0 (7.1.40) 
Since det R,, = det L, det Dy, det L? = §)&)---&, >0 (7.1.41) 


then det L,,D,, 4 0 and (7.1.39) has a unique solution ],,. Finally, from (7.1.41) we obtain 
Emi, = detR,,+)/detR,», and therefore &,,,,; > 0 because R,,+41 is positive definite. 
Hence, &,,, 1s uniquely computed from (7.1.41), which completes the proof. 

Because the triangular matrix L,, is generated row by row using (7.1.39) and because 
the diagonal elements of matrix D,, are computed sequentially using (7.1.40), both matrices 
have the optimum nesting property, that is, L, = L!”!, D, = D!”"!. The optimum filter 
Cm is then computed by solving 

LinDinkm = din (7.1.42) 

LE em = Km (7.1.43) 

Using (7.1.42), we can easily see that k,, has the optimum nesting property, that is, ky, = 

k!"] for 1 < m < M. This is a consequence of the lower triangular form of L,,. The 

computation of L,,, D,,, and k,, can be done in a simple, order-recursive manner, which 

is all that is needed to compute c,, for 1 < m < M. However, the optimum estimator 

does not have the optimum nesting property, that is, es # Cm, because of the backward 

substitution involved in the solution of the upper triangular system (7.1.43) (see Example 
6.3.1). 

Using (7.1.42) and (7.1.43), we can write the MMSE for the mth-order linear estimator 
as 


Py = Py — CH dn = Py —kEDnKm (7.1.44) 
which, owing to the optimum nesting property of D,, and kj), leads to 
Pry = Pm—t ~ EmlKml” (7.1.45) 


which is initialized with Pp = Py. Equation (7.1.45) provides an order-recursive algorithm 
for the computation of the MMSE. 


7.1.5 Order-Recursive Computation of the Optimum Estimate 


The computation of the optimum linear estimate jy, = ct Xm, using a linear combiner, 
requires m multiplications and m — 1 additions. Therefore, if we want to compute y,,, for 
1 <m < M, we need M linear combiners and hence M(M + 1)/2 operations. 

We next provide an alternative, more efficient order-recursive implementation that 
exploits the triangular decomposition of R,,+1. We first notice that using (7.1.43), we 
obtain 


3m = Ex = (KELD xm =k? LZ xm) (7.1.46) 


Next, we define vector w,, as 
LinWn= Xin (7.1.47) 


which can be found by using forward substitution in order to solve the triangular system. 
Therefore, we obtain 


m 
3m = ke wm = > kFw; (7.1.48) 
i=1 
which provides the estimate ¥,, in terms of k,, and w,,, that is, without using the estimator 
vector ¢,. Hence, if the ultimate goal is the computation of 3m we do not need to compute 
the estimator Cm. 
For an order-recursive algorithm to be possible, the vector w,, must have the optimum 
nesting property, that is, Ww, = wil. Indeed, using (7.1.37) and the matrix inversion by 
partitioning lemma for nonsymmetric matrices (see Problem 7.3), we obtain 


ms _ 
L7! = Ln 90 = 5, 0 
Ere Wee <i vi 


where Vm = —Ly lm = —CL")'D, Lr, = —R,'r° = bn 
due to (7.1.22). Therefore, 
Le! 0 x Ww, 
Wmn+l = Le = “i e = i (7.1.49) 
b;, 1 Xm+1 Wm+1 
where Wmn+1 = be x, + Xm41 = e (7.1.50) 


from (7.1.29). In this case, we can derive order-recursive algorithms for the computation 
of 3 and em, for all 1 < m < M. Indeed, using (7.1.48) and (7.1.49), we obtain 


Jn = Ym-1 + wm (7.1.51) 
with Jo = 0. From (7.1.51) and ey = y — Yn, we have 

em = €m—1 — KW (7.1.52) 
form = 1,2,..., M with eg = y. The quantity w,, can be computed in an order-recursive 


manner by solving (7.1.47) using forward substitution. Indeed, from the mth row of (7.1.47) 
we obtain 


m—-1 
Oe Siu Yh (7.1.53) 

i=1 
which provides a recursive computation of w, form = 1,2,...,M.To comply with the 
order-oriented notation, we use ie ) instead of lm—1,i-1. Depending on the application, 

we use either (7.1.51) or (7.1.52). 

For MMSE estimation, all the quantities are functions of the time index n, and therefore, 
the triangular decomposition of R,, and the recursions (7.1.51) through (7.1.53) should be 


repeated for every new set of observations y(n) and x(n). 


EXAMPLE 7.1.3. A linear estimator is specified by the correlation matrix Ry and the cross- 
correlation vector d4 in Example 6.3.2. Compute the estimates J), 1 < m < 4, if the input data 
vector is given by x4 = [121 — inh 


Solution. Using the triangular factor Ly and the vector ky found in Example 6.3.2 and (7.1.53), 
we find 


and H=l f=t $3=66 8 fy= 146 


which the reader can verify by computing ¢ and jm = cl xm, l<m <4. 


341 


SECTION 7.1 
Fundamentals of 
Order-Recursive 
Algorithms 


342 


CHAPTER 7 
Algorithms and Structures 
for Optimum Linear Filters 


If we compute the matrix 


1 0 
(1) 
eae by 1 Say 
Bn+i= Litt a ts . : 7 (7.1.54) 
(m) —p(m) 
be Be 1 
then (7.1.49) can be written as 
Wnt = CO.) = Bing tXm-+1 (7.1.55) 
where coh Sleep eee (7.1.56) 


is the BLP error vector. From (7.1.22), we can easily see that the rows of B,,.; are formed 
by the optimum estimators b,, of X41 from x,,. Note that the elements of matrix B,,.1 are 
denoted by using the order-oriented notation ae introduced in Section 7.1 rather than the 
conventional b,,; matrix notation. Equation (7.1.55) provides an alternative computation 
of Wm+1 aS a matrix-vector multiplication. Each component of w,,;; can be computed 


independently, and hence in parallel, by the formula 
: j—1 
wpaxyt > ex; = l<j<m (7.1.57) 


which, in contrast to (7.1.53), is nonrecursive. Using (7.1.57) and (7.1.51), we can derive 
the order-recursive MMSE estimator implementation shown in Figure 7.1. 


Input Decorrelator Innovations Linear Output 
combiner 


x) 


x2 


x3 


x4 


Basic processing element 


Gin 


Xin Yout = Xin + Gin 


Second-order moments 


FIGURE 7.1 
Orthogonal order-recursive structure for linear MMSE estimation. 


Finally, we notice that matrix Bm, provides the UDU” decomposition of the inverse 
correlation matrix R,,. Indeed, from (7.1.36) we obtain 


R,,| = L#)'p;'L;,| = BED; Bn (7.1.58) 


because inversion and transposition are interchangeable and the UDU” decomposition is 
unique. This formula provides a practical method to compute the inverse of the correlation 
matrix by using the LDL” decomposition because computing the inverse of a triangular 
matrix is simple (see Problem 7.5). 


7.2 INTERPRETATIONS OF ALGORITHMIC QUANTITIES 


We next show that various intermediate quantities that appear in the linear MMSE estimation 
algorithms have physical and statistical interpretations that, besides their intellectual value, 
facilitate better understanding of the operation, performance, and numerical properties of 
the algorithms. 


7.2.1 Innovations and Backward Prediction 


The correlation matrix of w,, is 
E{wmw!} = Lo! E{xmx2 JL = Din (7.2.1) 


where we have used (7.1.47) and the triangular decomposition (7.1.36). Therefore, the 
components of w,, are uncorrelated, random variables with variances 


&; = E{|wil”} @22) 
since €; > 0. Furthermore, the two sets of random variables {w1, w2,..., wy} and 
{x1,X2,...,Xm} are linearly equivalent because they can be obtained from each other 


through the linear transformation (7.1.47). This transformation removes all the redundant 
correlation among the components of x and is known as a decorrelation or whitening oper- 
ation (see Section 3.5.2). Because the random variables w; are uncorrelated, each of them 
adds “new information” or innovation. In this sense, {w 1, w2,..., Wm} is the innovations 
representation of the random variables {x1, x2, ..., Xm}. Because x,, = L,,W, the random 
vector Wy = e> is the innovations representation, and X, and W,, are linearly equivalent 
as well, (See Section 3.5). 
The cross-correlation matrix between x, and W,, iS 


E{Xmw?} = E{LinWnw!} =LnDn (7.2.3) 


which shows that, owing to the lower triangular form of Ly, E{x; wi} = 0 for j > i. 
We will see in Section 7.6 that these factors are related to the gapped functions and the 
algorithm of Schiir. 

Furthermore, since e> = Wm-+1, from (7.1.50) we have 


PR = Emit = E{lwm+il} 
which also can be shown algebraically by using (7.1.41), (7.1.40), and (7.1.30). Indeed, we 
have 
£ = det Rn ob 
fa det Rye 


and, therefore, 


VF’ Dnln = e°, —ro# Ro re = pb (7.2.4) 


m 


Dm = diag(P>?, PP,..., P>_,} (7.2.5) 
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7.2.2 Partial Correlation 


In general, the random variables y, x1, ..., Xm, Xm41 are correlated. The correlation be- 
tween y and x41, after the influence from the components of the vector x,, has been 
removed, is known as partial correlation. To remove the correlation due to x, we extract 
from y and x1 the components that can be predicted from x,,. The remaining correlation 
is from the estimation errors e,, and go. which are both uncorrelated with x,, because of 
the orthogonality principle. Therefore, the partial correlation of y and x+1 is 


PARCOR(y; Xm41) & Eleme?*} = E{(y — 62 xm)eb* 
= Efyeb*} = Efy(x%,, +x! bn)} 
= Elyxp yi) + Elyxn }Pm 


= dri + di Bm = Bor 


(7.2.6) 


where we have used the orthogonality principle E {x,, eb*} = 0 and (7.1.10), (7.1.50), and 
(7.1.34). 

The partial correlation PARCOR(y; x +1) is also related to the parameters k,, obtained 
from the LDL? decomposition. Indeed, from (7.1.42) and (7.1.54), we obtain the relation 


king = Dy) Bm idn+i (7.2.7) 


whose last row is 


be dn +tdmii BS 
me HK, (7.2.8) 
E m+ FP 


Kin+1 = 


owing to (7.2.4) and (7.2.6). 


EXAMPLE 7.2.1. The LDL” decomposition of matrix R3 in Example 7.1.2 is given by 


k 0 z| I 0 ° 
L=|4 1 |0 p= |0 0 
an a 
39 27 


and can be found by using the function [L, D]=1d1t (R) . Comparison with the results obtained 
in Example 7.1.2 shows that the rows of the matrix 


1 
“| eA 
1 4 
—5 -5 ! 


provide the elements of the backward predictors, whereas the diagonal elements of D are equal 
to the scalars a. Using (7.2.7), we obtain k = [1 2 yr whose elements are the quantities 


ko, ky» and k§ computed in Example 7.1.2 using the Levinson recursion. 


7.2.3 Order Decomposition of the Optimum Estimate 

The equation Yn41 = Ym + ky Wm with kn41 = pope = k¢, shows that the im- 
provement in the estimate when we include one more observation x41, that is, when we 
increase the order by 1, is proportional to the innovation w+ contained in x41. The 
innovation is the part of x,,+1 that cannot be linearly estimated from the already used data 
Xm- The term w+ 18 scaled by the ratio of the partial correlation between y and the “new” 
observation x41 and the power of the innovation P>. 


Thus, the computation of the (m+ 1)st-order estimate of y based onXm+1 = [x7 Xm+1] 
can be reduced to two mth-order estimation problems: the estimation of y based on Xp and 
the estimation of the new observation X41 based on X,. This decomposition of linear 
estimation problems into smaller ones has very important applications to the development 
of efficient algorithms and structures for MMSE estimation. 

We use the term direct for the implementation of the MMSE linear combiner as a sum 
of products, involving the optimum parameters ere 1 <i < ™m, to emphasize the direct 
use of these coefficients. Because the random variables w; used in the implementation of 
Figure 7.1 are orthogonal, that is, (w;, w;) = 0 fori # j, we refer to this implementation 
as the orthogonal implementation or the orthogonal structure. These two structures appear 
in every type of linear MMSE estimation problem, and their particular form depends on the 
specifics of the problem and the associated second-order moments. In this sense, they play 
a prominent role in linear MMSE estimation in general, and in this book in particular. 

We conclude our discussion with the following important observations: 


1. The direct implementation combines correlated, that is, redundant information, and it is 
not order-recursive because increasing the order of the estimator destroys the optimality 
of the existing coefficients. Again, the reason is that the direct-form optimum filter 
coefficients do not possess the optimal nesting property. 

2. The orthogonal implementation consists of a decorrelator and a linear combiner. The 
estimator combines the innovations of the data (nonredundant information) and is order- 
recursive because it does not use the optimum coefficient vector. Hence, increasing the 
order of the estimator preserves the optimality of the existing lower-order part. The 
resulting structure is modular such that each additional term improves the estimate by 
an amount proportional to the included innovation w,,.. 

3. Using the vector interpretation of random variables, the transformation X,, = Fj, X is just 
a change of basis. The choice F,, = L;} converts from the oblique set {x1, x2,...,Xm} 
to the orthogonal basis {w1, w2,..., Wm}. The advantage of working with orthogonal 
bases is that adding new components does not affect the optimality of previous ones. 

4. The LDL” decomposition for random vectors is the matrix equivalent of the spectral 
factorization theorem for discrete-time, stationary, stochastic processes. Both approaches 
facilitate the design and implementation of optimum FIR and IIR filters (see Sections 
6.4 and 6.6). 


7.2.4 Gram-Schmidt Orthogonalization 


We next combine the geometric interpretation of the random variables with the Gram- 
Schmidt procedure used in linear algebra. The Gram-Schmidt procedure produces the in- 
novations {wW1, W2,..., Wm} by orthogonalizing the original set {x1, x2,..., Xm}. 

We start by choosing w to be in the direction of x1, that is, 


Wi =X1 


The next “vector” w2 should be orthogonal to w;. To determine w2, we subtract from x2 
its component along w [see Figure 7.2(a)], that is, 


w2 = x2 — I wy 
where Le is obtained from the condition w2 _L wy) as follows: 


(w2, wi) = (x2, wi) —1 (wr, wi) =0 


(x2, W1) 


or ie — 
(wi, W1) 
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iw 
01 w= X) 


(a)m=2 


FIGURE 7.2 
Illustration of the Gram-Schmidt orthogonalization process. 


Similarly, to determine w3, we subtract from x3 its components along w, and wz, that is, 
wW3 = %3- wy _ 1? wo 
as illustrated in Figure 7.2(b). Using the conditions w3 L w, and w3 L wo, we can easily 
see that 
i= (x3, W1) j2) (x3, W2) 
: (wi, w1) : (w2, w2) 
This approach leads to the following classical Gram-Schmidt algorithm: 


e Define w; = x1. 
e For 2 <m < M, compute 


in Sg = a 1S ie (7.2.9) 
where yor) = (Xm=1, Wi) (7.2.10) 
(wi, Wi) 
assuming that (w;, wi) 4 0. 
From the derivation of the algorithm it should be clear that the sets {x;,..., x} and 
{W1,..., Wm} are linearly equivalent form = 1, 2,..., M. Using (7.2.11), we obtain 
Xn= LinWm (7.2.11) 
1 0 0 
iw 1 52.0 
where i=). ae? (7.2.12) 
jet) yor) re 


is a unit lower triangular matrix. Since, by construction, the components of w,, are uncor- 
related, its correlation matrix D,, is diagonal with elements €; = E{|w;|7}. Using (7.2.11), 
we obtain 


Rm = E{xmx!} = Ln E{wmw! LE = LDL! (7.2.13) 


which is precisely the unique LDL” decomposition of the correlation matrix R,,. There- 
fore, the Gram-Schmidt orthogonalization of the data vector Xm provides an alternative 
approach to obtain the LDL” decomposition of its correlation matrix Rm = E{Xmx!7}. 


7.3 ORDER-RECURSIVE ALGORITHMS FOR OPTIMUM FIR FILTERS 


The key difference between a linear combiner and an FIR filter is the nature of the input 
data vector. The input data vector for FIR filters consists of consecutive samples from the 
same discrete-time stochastic process, that is, 


Xm(n) = [x(n) x(n — 1) --- x(n—-m+1)]" (7.3.1) 


instead of samples from m different processes x;(n). This shift invariance of the input data 
vector allows for the development of simpler, order-recursive algorithms and structures 
for optimum FIR filtering and prediction compared to those for general linear estimation. 
Furthermore, the quest for order-recursive algorithms leads to a natural, elegant, and un- 
avoidable interconnection between optimum filtering and the BLP and FLP problems. 

We start with the following upper and lower partitioning of the input data vector 


x(n) 
x(n — 1) 
Xm (11) x(n) 
x =|: = = 7.3.2 
m1 () : ie - =| BE (n — a ? 
x(n —m-+1) 
x(n — m) 
which shows that x (n) and xn (n) are simply shifted versions (by one sample delay) 


of the same vector x,,(n). The shift invariance of X,+1(7) results in an analogous shift 
invariance for the correlation matrix Rn) = E {Xm41(M)X py +1}. Indeed, we can 
easily show that the upper-lower partitioning of the correlation matrix is 


Rn(n) rp(n) 
Rn4i() = bH (7.3.3) 
Tm (2) Px(n—m) 
and the lower-upper partitioning is 
Pr(n) yl (n) 
Rniil@) = f (7.3.4) 
rn) Rn(n— 1) 
where r (n) = E{Xm(n)x*(n — m)} (7.3.5) 
r! (n) = E{xm(n — 1)x*(n)} (7.3.6) 
P,(n) = E{\x(n)/7} (7.3.7) 
We note that, in contrast to the general case (7.1.5) where the matrix R‘, (n) = RI"! (n) 
is unrelated to Ry»(n), here the matrix RY”! (1) = Rn (n — 1). This is a by-product of 


the shift-invariance property of the input data vector and takes the development of order- 
recursive algorithms one step further. We begin our pursuit of an order-recursive algorithm 
with the development of a Levinson order recursion for the optimum FIR filter coefficients. 


7.3.1 Order-Recursive Computation of the Optimum Filter 


Suppose that at time n we have already computed the optimum FIR filter ¢,,(”) specified 
by 
En (n) = Ry, (a)dmn (2) (7.3.8) 
and the MMSE is 
Py (n) = Py(n) — dy (1)€m (1) (7.3.9) 
where din(n) = E{xm(n)y*(n)} (7.3.10) 
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We wish to compute the optimum filter 
emir) = RA Meni) 


by modifying ¢,,(m) using an order-recursive algorithm. From (7.3.3), we see that matrix 
Ry»+1(”) has the optimum nesting property. Using the upper partitioning in (7.3.2), we 


obtain 
Xm (1) . a dy, (7) 
dnoi@a) = E | a 7 z| y | = ae (7.3.11) 


which shows that d,,,+1 (7) also has the optimum nesting property. Therefore, we can develop 
a Levinson order recursion using the upper left matrix inversion by partitioning lemma 


R,,'() 0 1 [bn(n) 
-1 i m m 
R741) = Ee 4 PP) i [b2(n) 1] (7.3.12) 
where Din(n) = —R;Z'(n)r> (n) (7.3.13) 
is the optimum BLP, and 
b(n) — Het Rati™) _ » bH 
P>(n) = ear Py (n — m) +>! (n)bm(n) (7.3.14) 


is the corresponding MMSE. Equations (7.3.12) through (7.3.14) follow easily from (7.1.22), 
(7.1.23), and (7.1.24). It is interesting to note that b,, (7) is the optimum estimator for the 
additional observation x(n — m) used by the optimum filter c,,41 (7). Substituting (7.3.11) 
and (7.3.12) into (7.3.8), we obtain 


oiG= ° 1 # iM a KE (n) (7.3.15) 

c A Bin @) 
where ki) = PP(n) (7.3.16) 
and BS (n) © be (n)dn (2) + din) (7.3.17) 


Thus, if we know the BLPb,, (7), we can determine ¢,,1 (7) by using the Levinson recursion 
in (7.3.15). 


Levinson recursion for the backward predictor. For the order recursion in (7.3.15) to 
be useful, we need an order recursion for the BLP b,,(7). This is possible if the linear 
systems 

Rin (2)bm (2) = —ty, (72) 
: (7.3.18) 
Rn4i@)bn41() = —Tn41 (n) 
are nested. Since the matrices are nested [see (7.3.3)], we check whether the right-hand 
side vectors are nested. We can easily see that no optimum nesting is possible if we use the 
upper partitioning in (7.3.2). However, if we use the lower-upper partitioning, we obtain 


b = x(n) * A rg () 
Ps | Be - | aaa | Z is =i) es 


which provides a partitioning that includes the wanted vector ro. (n) delayed by one sample 
as a result of the shift invariance of x,,(”). To explore this partitioning, we use the lower- 
upper corner matrix inversion by partitioning lemma 


asl: +o |h i alan] (7.3.20) 
m= JQ Rolin — 1] PFD | amin) | OA ~ 


Db 


where am (n) = —R,1(n — Dri, (a) (7.3.21) 
is the optimum FLP and 
det Rn+1 (7) 
f +1 
Prt) = ath, -D = Py(n) +04 (n)an(n) (7.3.22) 


is the forward linear prediction MMSE. Equations (7.3.20) through (7.3.22) follow easily 
from (7.1.26) through (7.1.28). Substituting (7.3.20) and (7.3.19) into 


bing i(n) = R74 (a) () 
we obtain the recursion 
b =|? i ke 7.3.23 
m+1(n) = age} ae a8 m (11) (7.3.23) 
Bk Hn BAO 
where k, (1) = pi (n) (7.3.24) 
and Bo (n) 27° (a) +a nye? (n — 1) (7.3.25) 


To proceed with the development of the order-recursive algorithm, we clearly need an order 
recursion for the optimum FLP a,, (7). 


Levinson recursion for the forward predictor. Following a similar procedure for the 
Levinson recursion of the BLP, we can derive the Levinson recursion for the FLP. If we use 
the upper-lower partitioning in (7.3.2), we obtain 


f , Tin (2) 
Pn41@) = E{Xm4i(n — I)x"()} =] 5 (7.3.26) 
lmt+l (n) 
which in conjunction with (7.3.12) and (7.3.21) leads to the following order recursion 
an(n bi(n — 1 : 
Ami 1(n) = ke / a ba | kt (n) (7.3.27) 
: f 
where kn) & sen. (7.3.28) 
P>(n—1) 

and Bin) 2 bea — Dri (n) tri @ (7.3.29) 
Is an order-recursive algorithm feasible? For m = 1, we have a scalar equation 


rip(n)c\)(n) = di(n) whose solution is ct (n) = di(n)/ri(n). Using the Levinson 
order recursions form = 1,2,..., M — 1, we can find cy (7) if the quantities b,,(” — 1) 
and pb (n—1),1 < m < M, required by (7.3.27) and (7.3.28) are known. The lack of 
this information prevents the development of a complete order-recursive algorithm for the 
solution of the normal equations for optimum FIR filtering or prediction. The need for time 
updates arises because each order update requires both the upper left corner and the lower 
right corner partitionings 


R = R,(n) x _|x x 
mil) a lua cam 


of matrix R,,4,. The presence of R,,(n — 1), which is a result of the nonstationarity of 
the input signal, creates the need for a time updating of b,,(n). This is possible only for 
certain types of nonstationarity that can be described by simple relations between R,, (1) and 
R,»,(n — 1). The simplest case occurs for stationary processes where R», (1) = Ry (m— 1) = 
R,,. Another very useful case occurs for nonstationary processes generated by linear state- 
space models, which results in the Kalman filtering algorithm (see Section 7.8). 
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Partial correlation interpretation. The partial correlation between y(n) and x(n—m), 
after the influence of the intermediate samples x(n), x(n — 1),...,x(2 —m-+ 1) has been 
removed, is 


Efe? (nye*,(n)} = be (2) din (2) + ding 1 (2) = BS, (n) (7.3.30) 


which is obtained by working as in the derivation of (7.2.6). It can be shown, following 
a procedure similar to that leading to (7.2.8), that the k,,.(m) parameters in the Levinson 
recursions can be obtained from 


bees Ln (n)Dn (n)Li (1) 

Ln (n)Din (1) Ké (2) = dn (1) 

Ln (2)Dm(n) ki, (n) = vr, (n) 
Ln(n — )Dn(n — 1)k* (n) = rh (n) 


(7.3.31) 


that is, as a by-product of the LDL” decomposition. 

Similarly, if we consider the sequence x(n), x(n — 1),...,x(1 —m),x(n —m — 1), 
we can show that the partial correlation between x(n) and x(n — m — 1) is given by (see 
Problem 7.6) 


Efe n — Hef (n)} = rh (0) + bE (n — Deh) = Bin) (73230) 
Because re 41 (n) = aes "11 (n), we have the following simplification 
Br (2) = bi (2 — 1)Rm(n — 1)R;,1 (2 = Dr, (2) + hy) 
= rH (n — Vam(n) + ret (2) = Bre (n) 


which is known as Burg’s lemma (Burg 1975). In order to simplify the notation, we define 


Bm (n) = Bi, (n) = BR*(n) (7.3.33) 
Using (7.3.24), (7.3.28), and (7.3.30), we obtain 
2 b jy — 1)\pf* 2 


Pr(n)Pr(n—1) Efile, (n)?}E {leh (2 — 1/7} 
which implies that 
O0< kt (n)k> (n) < 1 (7.3.35) 
because the last term in (7.3.34) is the squared magnitude of the correlation coefficient of 


the random variables ef (n) and e> (n— 1). 


Order recursions for the MMSEs. Using the Levinson order recursions, we can ob- 
tain order-recursive formulas for the computation of pt (n), pb (n), and P* (n). Indeed, 
using (7.3.26), (7.3.27), and (7.3.29), we have 


PE (n) = Py(n) + ry (2) amqi(n) 


m Din —1 
= P,(n) + [rt Hoke ont | “ i Koo] 


= P,(n) +9 (n)an(n) + Fh (n)bm(n — 1) +18 (kn (0) 


or Pt) = PL) + BL) k(n) = Pin) — 1B mi)? (7.3.36) 
m+ m m m m Pe(n a 1) 
If we work in a similar manner, we obtain 
2 
Prvi@) = PP — 1) + Bm (n)kp(n) = Pr(n — 1) - Pm(ol" (7.3.37) 


Pn) 


2 
Bin (7) | 
Pa) 
If the subtrahends in the previous recursions are nonzero, increasing the order of the filter 
always improves the estimates, that is, r 41 @) < P*(n). Also, the conditions pi (n) £0 


and Ph 41M) = Pa(n) — BS (ky, (2) = Pea) - (7.3.38) 


and Pe (n) # 0 are critical for the invertibility of R,,(7) and the computation of the 
optimum filters. The above relations are special cases of (7.1.45) and can be derived from 
the LDL” decomposition (see Problem 7.7). The presence of vectors with mixed optimum 
nesting (upper-lower and lower-upper) in the definitions of 6,,(n) and B* (n) does not lead 
to similar order recursions for these quantities. However, for stationary processes we can 
break the dot products in (7.3.17) and (7.3.25) into scalar recursions, using an algorithm 
first introduced by Schiir (see Section 7.6). 


7.3.2 Lattice-Ladder Structure 


We saw that the shift invariance of the input data vector made it possible to develop the 
Levinson recursions for the BLP and the FLP. We next show that these recursions can 
be used to simplify the triangular order-recursive estimation structure of Figure 7.1 by 
reducing it to a more efficient (linear instead of triangular), lattice-ladder filter structure 
that simultaneously provides the FLP, BLP, and FIR filtering estimates. 
The computation of the estimation errors using direct-form structures is based on the 
following equations: 
em (MN) = x(n) + ayy (1)Xm(2 — 1) 
e> (n) = x(n —m) + b# (n)xXm(n) (7.3.39) 
em(n) = y(n) — Cf (2)Xm(n) 
Using (7.3.2), (7.3.27), and (7.3.39), we obtain 


H 
f = an (1) bn (n — 1) f Xm(n — 1) 
soit =x + ||} ak ky (") ae, 


= x(n) +a4 (n)x_(n — 1) + [b4 (2 — 1)xXm(n — 1) + x(n — 1 — m) KE (0) 


or emi) = ey (n) + ky (ndep, (n = 1) (7.3.40) 
In a similar manner, we obtain 

enn) =e (n— 1) +k (nel, (n) (7.3.41) 
using (7.3.2), (7.3.23), and (7.3.39). Relations (7.3.40) and (7.3.41) are executed for m = 
0,1,..., M —2, with el, (n) = eb (n) = x(n), and constitute a lattice filter that implements 


the FLP and the BLP. 
Using (7.3.2), (7.3.15), and (7.3.39), we can show that the optimum filtering error can 
be computed by 


€m+1(1) = @m(n) — k&*(n)e> (n) (7.3.42) 


which is executed form = 0,1,..., M—1, with eg(7) = y(n). The last equation provides 
the ladder part, which is coupled with the lattice predictor to implement the optimum filter. 
The result is the time-varying lattice-ladder structure shown in Figure 7.3. Notice that a new 
set of lattice-ladder coefficients has to be computed for every n, using Rj, (”) and dj, (1). The 
parameters of the lattice-ladder structure can be obtained by LDL” decomposition using 
(7.3.31). Suppose now that we know Pi (n) = P?(n) = Py(n), P?(n—1), Pé(n) = Py(n), 
{Bm(n)} |, and {8° (n)}/. Then we can determine P£(n), PO(n), and P“(n) for all 
m, using (7.3.36) through (7.3.38), and all filter coefficients, using (7.3.16), (7.3.24), and 


351 


SECTION 7.3 
Order-Recursive 
Algorithms for Optimum 
FIR Filters 


Algorithms and Structures 
for Optimum Linear Filters 


ey(n) ex(n) 


Lattice part 


e(n) ey-i(n) 


e(n) ey (n) 
y(n) eee 


Ladder part 


FIGURE 7.3 
Lattice-ladder structure for FIR optimum filtering and prediction. 


(7.3.28). However, to obtain a completely time-recursive updating algorithm, we need time 
updatings for B,,,(m) and BF, (n). As we will see later, this is possible if R(m) and d(n) are 
fixed or are defined by known time-updating formulas. 

We recall that the BLP error vector e 41M) is the innovations vector of the data 
Xm+1(n). Notice that as a result of the shift invariance of the input data vector, the triangular 
decorrelator of the general linear estimator (see Figure 7.1) is replaced by a simpler, “linear” 
lattice structure. For stationary processes, the lattice-ladder filter is time-invariant, and we 
need to compute only one set of coefficients that can be used for all signals with the same 
R and d (see Section 7.5). 


7.3.3 Simplifications for Stationary Stochastic Processes 


When x(7) and y(7) are jointly wide-sense stationary (WSS), the optimum estimators are 
time-invariant and we have the following simplifications: 


e All quantities are independent of n; thus we do not need time recursions for the BLP 
parameters. 

¢ bm = Ja;, (see Section 6.5.4), and thus we do not need the Levinson recursion for the 
BLP by. 


Both simplifications are a consequence of the Toeplitz structure of the correlation matrix 
R,,. Indeed, comparing the partitionings 


- R, Jt r(0) rf sere 
m+1(1) — ry r(0) = r R, ( J ) 
where Im = [r() r(2) --- r(m)]" (7.3.44) 


with (7.3.3) and (7.3.4), we have 
Rn (2) = Rn (x — 1) = Rn 
ri(n) =r*, (7.3.45) 
r? (2) = Jtm 


which can be used to simplify the order recursions derived for nonstationary processes. 
Indeed, we can easily show that 


a b 
anti = bal + | hi (7.3.46) 


where Dn = Jas, (7.3.47) 


km & ki, = k* = _Bm (7.3.48) 
Pin 
Bm & Bi, = pot = be r* + r*(m +1) (7.3.49) 
Pry © Pin = Pry = Pmt + Byy—phm—1 = Pmt + Bmn— 1k (7.3.50) 


This recursion provides a complete order-recursive algorithm for the computation of the 
FLP a,, for 1 < m < M from the autocorrelation sequence r(/) forO <1 < M. 

The optimum filters ¢,, for 1 < m < M can be obtained from the quantities a,, and 
Pm for 1 <m < M — 1 and dy, using the following Levinson recursion 


es OP 32 |e ge (73.51) 
Cn+1 = 0 1 m Oe 


c a Bm 
where k= (7.3.52) 
Py 


and BS, = be din + dnt (7.3.53) 
The MMSE P* is then given by 
Py Po Bee (7.3.54) 


and although it is not required by the algorithm, P* is useful for selecting the order of the 
optimum filter. Both algorithms are discussed in greater detail in Section 7.4. 


7.3.4 Algorithms Based on the UDU” Decomposition 


Hermitian positive definite matrices can also be factorized as 
R = UDU" (7.3.55) 


where U is a unit upper triangular matrix and D is a diagonal matrix with positive elements 
€;, using the function [U, D]=udut (R) (see Problem 7.8). Using the decomposition (7.3.55), 
we can obtain the solution of the normal equations by solving the triangular systems, first 
for k 


(UD)k +d (7.3.56) 
and then for c Urc=k (7.3.57) 


by backward and forward substitution, respectively. The MMSE estimate can be computed 
by 


gactxy =k wW (7.3.58) 
where w4U x (7.3.59) 


is an innovations vector for the data vector x. It can be shown that the rows of A & U! are 
the linear MMSE estimator of x», based on [%m+1 Xm+2 °° X ul. Furthermore, the UDU” 
factorization (7.3.55) can be obtained by the Gram-Schmidt algorithm, starting with x,y 
and proceeding “backward” to x; (see Problem 7.9). The various triangular decompositions 
of the correlation matrix R are summarized in Table 7.1. 

If we define the reversed vectors x & Jx and w = Jw, we obtain 


x = Jx = JLJJw = JLJw 4 Uw (7.3.60) 
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TABLE 7.1 
Summary of the triangular decompositions of the correlation matrix. 


Decom- 
position | | SN IMJ ~~ SA | | 
Matrix 
R LDL? UDU" A=U"! 
R! APDA B"D'B B=L"! 


because J* = I and U = JLJ is upper triangular. The correlation matrix of ¥ is 
R = £(a%") = ODO" (7.3.61) 


where D4 E {ww} is the diagonal correlation matrix of w. Equation (7.3.61) provides 
the UDU” decomposition of R. 

A natural question arising at this point is whether we can develop order-recursive 
algorithms and structures for optimum filtering using the UDU” instead of the LDL” 
decomposition of the correlation matrix. The UDU” decomposition is coupled to a parti- 
tioning of R,,,+1 (7) starting at the lower right corner and moving to the upper left corner 
that provides the following sequence of submatrices 


Ri(a—m) > Roa-mt+1—> --- > Ran) (7.3.62) 
which, in turn, are related to the FLPs 
aj(n —m) > an(n—m+1)—> --- > an (n) (7.3.63) 
and the FLP errors 
ed(n—-mt+l1)> d(n—m+2) > --- oan) (7.3.64) 
If we define the FLP error vector 
ef. (1) = [ey,(n) ef, (1-1) ++» eg(n—m)]" (7.3.65) 
we see that e,41() = Am+1(1)Xm41(1) (7.3.66) 
where 
1 a(n) a(n) s+ dn (n) 
01 a MGA) se. a 1) 
Amin) =]: : in < (7.3.67) 
0 0 0 1 a&a—m+1 
0 0 0 0 1 
f 


The elements of the vector e,, 41 (n) are uncorrelated, and the LDL” decomposition of the 
inverse correlation matrix (see Problem 7.10) is given by 


R51 (2) = AH, (n)DF) Ama) (7.3.68) 


f f 
m+ m+ 


instead of e? 4) results in a complicated lattice structure because of the additional delay 


where Dnt 1 (71) is the correlation matrix ofe,, , , (7). Using e,, _ , (7) as an orthogonal basis 
elements required for the forward prediction errors. Thus, the LDL? decomposition is the 


method of choice in practical applications for linear MMSE estimation. 


7.4 ALGORITHMS OF LEVINSON AND LEVINSON-DURBIN 


Since the correlation matrix of a stationary, stochastic process is Toeplitz, we can explore 
its special structure to develop efficient, order-recursive algorithms for the linear system 
solution, matrix triangularization, and matrix inversion. Although we develop such algo- 
rithms in the context of optimum FIR filtering and prediction, the results apply to other 
applications involving Toeplitz matrices (Golub and van Loan 1996). 

Suppose that we know the optimum filter c,, is given by 


Cn = Rp din (7.4.1) 
and we wish to use it to compute the optimum filter ¢,.+1 
Cn41 = Ri dnt (7.4.2) 
We first notice that the matrix R,.41 and the vector d+; can be partitioned as follows 
r(O) -++ r(m—1) |} r(m) 
: tas : R Jr 
Rn = |‘ = ie maa (7.4.3) 
r*(m—1) --- r(0) r(1) TnJ r(0) 
r*(m) pee Pub) r(0) 
d 
ding = “ (7.4.4) 
dn+1 


which shows that both quantities have the optimum nesting property, that is, RI" =R, 


[m] 
and qin = dy. 
Using the matrix inversion by partitioning lemma (7.1.24), we obtain 


R,| 0 1 |b 
Sl ad~< m m H 
R= ke | + 3p : [bf 1] (7.4.5) 
where Dn = —R;'Jtm (7.4.6) 
and P> —r(0) +r! Ibn (7.4.7) 


Substitution of (7.4.4) and (7.4.5) into (7.4.2) gives 


Pr bg Dm ke (7.4.8) 
Cm41 = A A. 
+1 0 1 


ee 

cA m 

where ie pe (7.4.9) 
m 

and Brn = be din ag din+1 = —e! Irn et din+1 (7.4.10) 


Equations (7.4.8) through (7.4.10) constitute a Levinson recursion for the optimum filter 
and have been obtained without making use of the Toeplitz structure of Rj+1. 

The development of a complete order-recursive algorithm is made possible by exploit- 
ing the Toeplitz structure. Indeed, when the correlation matrix R,, is Toeplitz, we have 


bn = Ja*, (7.4.11) 
and Py & p> = pt (7.4.12) 


as we recall from Section 6.5. Since we can determine b,, from a,,, we need to perform 
only one Levinson recursion, either for b,, or for ay. 
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To avoid the use of the lower right corner partitioning, we develop an order recursion 
for the FLP a,,. Indeed, to compute a,,.+1 from a,,, recall that 


amet = Reta (7.4.13) 
which, when combined with (7.4.5) and 
Im 
r = TAA4 
m+1 i + »| ( ) 


leads to the Levinson recursion 


a b 
antl = "| Ae i kin (7.4.15) 


where km = —-— (7.4.16) 
Pn 

Bm & bert, +r*(m +1) =al Jet, +r*(m + 1) (7.4.17) 

and Pn =1r(0) +rHa* =r) +alrn (7.4.18) 


Also, using (7.1.46) and (7.2.6), we can show that 


m—1 
and detR, =[]]P — with Py =r(0) (7.4.19) 
i=0 


-_ det Rn 
det Rin 


which emphasizes the importance of P,, for the invertibility of the autocorrelation matrix. 
The MMSE P,,, for either the forward or the backward predictor of order m can be computed 
recursively as follows: 


H x am Dn } 
Pn+1 =r(0)+ [tin r(m + 1)] 0 ie 1 km (7.4.20) 


=r(0) +r}! at +(e? be, + 7r*(m + DIK%, 
or P+ = Pn + Bake = Pn + Br kin = Prd = km |") (7.4.21) 
The following recursive formula for the computation of the MMSE 


PS 


ed == Pe = la Ba en (7.4.22) 


mm 


can be found by using (7.4.8). 

Therefore, the algorithm of Levinson consists of two parts: a set of recursions that 
compute the optimum FLP or BLP and a set of recursions that use this information to 
compute the optimum filter. The part that computes the linear predictors is known as the 
Levinson-Durbin algorithm and was pointed out by Durbin (1960). From a linear system 
solution point of view, the algorithm of Levinson solves a Hermitian Toeplitz system with 
arbitrary right-hand side vector d; the Levinson-Durbin algorithm deals with the special 
case d = r* or Jr. Additional interpretations are discussed in Section 7.7. 


Algorithm of Levinson-Durbin 

The algorithm of Levinson-Durbin, which takes as input the autocorrelation sequence 
r(O),r(1),...,7(M) and computes the quantities ay, Py, and k,—1 form = 1,2,..., M, 
is illustrated in the following examples. 


EXAMPLE 7.4.1. Determine the FLP a7 = [a aye and the MMSE P} from the autocorre- 
lation values r(O), r(1), and r(2). 


Solution. To initialize the algorithm, we determine the first-order predictor by solving the 


normal equations r(d)at? = —r*(1). Indeed, we have 
* 
1 
al) = r ee Bo 
r(0) Po 
which implies that Bo =r*() Py = r(0) 


To update to order 2, we need k; and hence 6; and P;, which can be obtained by 
r(O)r* (2) — (IP 
r(0) 
r°(0) — Ir)? 
r(0) 


_ QP = rO)r*2) 
2) = |r? 


Therefore, using Levinson’s recursion, we obtain 


By =a! r*(1) + r*Q) = 


Py = Po + Bok = 


as 


(2) (a), ()« r(1)r*(2) — r(O)r* (1) 
a k= 
a, =a, + ay 1= 72(0) Ir(ly|2 


and as =k, 


which agree with the results obtained in Example 6.5.1. The resulting MMSE can be found by 
using Py = Py + By ky. 


EXAMPLE 7.4.2. Use the Levinson-Durbin algorithm to compute the third-order forward predictor 
for a signal x(n) with autocorrelation sequence r(0) = 3, r(1) = 2, r(2) = 1, andr(3) = 


1 
>: 
d) 


Solution. To initialize the algorithm, we notice that the first-order predictor is given by r(O)a; 


= —r(1) and that for m = 0, (7.4.15) gives at = ko. Hence, we have 
1 2 
pyrene A eter 
r(0) 3 Po 
which implies Po =r(0) =3 By =r(1) =2 


To compute ay by (7.4.15), we need a oe = ar and kj = —£,/P. From (7.4.21), we 
have 


P, = Py + Boko =3 + 2-4) = 2 


and from (7.4.17) 


By =r{ Jay +rQ)=2-H) +1 =-5 
1 
ey 1 
Hence, Geek Oe Bs 
Pi 5 5 
3 
_2 1 _2 a4 
“ Pst Ey 
0 1 ; 
Continuing in the same manner, we obtain 
Pp = Py + Biky = 3+ (-HG =3 


bo =r Jan+rQ3)=[2 U 
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P3 = Py + Boky = 8 + 


1 


10 


an 


a Ua UA 
jo 
| 
IF aie alB 
ees 


(-+) = Sl 
16) ~ 32 


The algorithm of Levinson-Durbin, summarized in Table 7.2, requires M operations and 
is implemented by the function [a,k,Po]=durbin(r,M). 


TABLE 7.2 
Summary of the Levinson- 
Durbin algorithm. 


1. Input: 7(0),r(1),r(2),...,r(M) 


2. Initialization 
(a) Po =r(0), By =r*() 
(b) ky = —r*(1)/r(0), a\? = ko 
3. Form=1,2,...,M—1 
(a) Pm = Pm—1t+ Bm-16_ | 
(b) tm = [r(l) r(2) «++ r(m)]" 
(Cc) Bm =at Irs, t+r*(m+1) 


m 


a Ja* 
(e) ami = | + ki ' km 


4. Py = Pu-1+ Buky 


5. Output: ay, feel {Pmy! 


Algorithm of Levinson 


The next example illustrates the algorithm of Levinson that can be used to solve a 
system of linear equations with a Hermitian Toeplitz matrix and arbitrary right-hand side 


vector. 


EXAMPLE 7.4.3. Consider an optimum filter with input x(n) and desired response y(n). The 
autocorrelation of the input signal is r(0) = 3, r(1) = 2, and r(2) = 1. The cross-correlation 
between the desired response and input is dj = 1, dz = 2, and d3 = 2: and the power of y(n) 
is Py = 3. Design a third-order optimum FIR filter, using the algorithm of Levinson. 


Solution. We start initializing the algorithm by noticing that for m = O we have r(O)a\ 


—r(1), which gives 


Py =r(0) =3 


and 


al 
al 


Vsjes 


I) _ 


rq) 2 
r0) 3 
Bo =rd) =2 


P, = Po + Boky = 3 +2(-9) = 3 


Next, we compute the Levinson recursion for the first-order optimum filter 


0-1 FO) 3 


1 8 
PE = PS — Bok6 =3-143)=8 


Then we carry the Levinson recursion for m = 1 to obtain 


By =ri Jay t+ r(2) =2(-3) +1 =-4 
1 
k By ee 
{== —— 
P, 3 5 
3 
es eral fees 
SCl-[ 
0 1 5 
~ So (a yy By 8 
Py = Pp + Biky = 3 4+(-37)GQ) = 


for the optimum predictor, and 


¢ sal Jd) +d) = -F (1) +2 =F 


Bi 
i= >= 


1 
Rast 4 —, 
3 

Q= +a 
H | 1 


Cc _ pe cre _ 8 4/4, _ 8 
oe ie i= 3-3 Ss 


v 
— 

WIN Volta] Uo| 
Nm 


re | 
ll 
| 
als pe 


for the optimum filter. The last recursion (m = 2) is carried out only for the optimum filter and 


gives 


i a Nahi ae ae 
ps =a sd, +a =| 5 -§]|2]+3- 


5 5] }2 10 
11 
= f2- 1 1 
2" Pp 8 16 
5 
a 1 ae 
J ae ee : 
© a2 | pc 4 ail (ee 1 
o-[]+[]4=| a] +i a 
11 
0 1 i 
8 11,11 27 
PS = Py — Boks = 5 — 196) = 33 


The algorithm of Levinson, summarized in Table 7.3, is implemented by the Mat- 
LAB function [c,k,kc,Pc]=levins(R,d,Py,M) and requires 2M2 operations because it 
involves two dot products and two scalar-vector multiplications. A parallel processing im- 
plementation of the algorithm is not possible because the dot products involve additions 
that cannot be executed simultaneously. Notice that adding M = 27 numbers using M/2 
adders requires g = log, M steps. This bottleneck can be avoided by using the algorithm 
of Schiir (see Section 7.6). 


Minimum phase and autocorrelation extension 
Using (7.4.16), we can also express the recursion (7.4.21) as 


bal 
Pin 


Prt = Pm(1 — |\Km|?) = Pm 
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TABLE 7.3 
Summary of the algorithm of Levinson. 


1. Input: {r()}47, {din}, Py 

2. Initialization 
(a) Po =r(0), Bo =r*(1), PS = Py 
(b) ko = —Bo/Po. at” = ko 
(c) Bo =a 
(d) kf = 65 /Po, ¢,? =k6 
(e) Pe = P+ Boker 

3. Form =1,2,...,M—-—1 
(a) tm =[r(1) r(2) +++ r(m)]" 
(b) Bm = ab Jet, +r (m+ 1) 
(c) Pm = Pm—-1 + Bm—1K)yy)—1 

Bn 

(d) km = =a. 


) _ | am Jay, k 
(e@) amt = 0 + 1 m 


(f) Bin = —c! Sem + dint 
Be 
(g) ky, = 
= Pin 


Cm Jay, é 
(A) Cmy1 = ie + i ie 


(i) Pr 


mr 


4. Output: ay.cm, tk. kG}! {Pm PS 


= Ph + Bak 


mm 


which, since P,, > 0, implies that 


Pri < Pin (7.4.24) 
and since the matrix R,, is positive definite, then P,, > 0 and (7.4.23) implies that 
lkm| <1 (7.4.25) 
for all 1 <m < M.If 
Po>-:-> Py-1> Py =0 (7.4.26) 
then the process x (7) is predictable and (7.4.23) implies that 
km =+1 and lkm| <1 1<k<M (7.4.27) 
(see Section 6.6.4). Also if 
Py-1 > Pu =-::-=Pxo =P>O0 (7.4.28) 
from (7.4.23) we have 
km = 0 form > M (7.4.29) 
which implies that the process x(n) is AR(M) and el (n) ~ WN(O0, Py) (see Section 
4.2.3). Finally, we note that since the sequence Po, P|, P2,... iS nonincreasing, its limit 


as m — oo exists and is nonnegative. A regular process must satisfy |k»,| < 1 for all m, 
because |k,,| = 1 implies that P,, = 0, which contradicts the regularity assumption. 

For m = 0, (7.4.19) gives Po = r(O). Carrying out (7.4.23) from m = 0 tom = M, 
we obtain 


M 
Pu =r(0) | [= |km—1!?) (7.4.30) 
m=1 


which converges, as M > ov, if |k,,| < 1. 
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AND PREDICTORS 


SECTION 7.5 
Lattice Structures for 
To compute the forward prediction error of an FLP of order m, we use the formula Optimum FIR Filters 
i and Predictors 
ef, (a) = x(n) + af Xm(n — 1) = x(n) +) ag"* x(n — &) (7.5.1) 
k=1 


Similarly, for the BLP we have 


m—1 
eb (n) = x(n — m) + BAX, (n) = x(n —m) + > BU"* x(n — b) (7.5.2) 
k=0 


Both filters can be implemented using the direct-form filter structure shown in Figure 7.4. 
Since a,, and b,, do not have the optimum nesting property, we cannot obtain order-recursive 
direct-form structures for the computation of the prediction errors. However, next we show 
that we can derive an order-recursive lattice-ladder structure for the implementation of 
optimum predictors and filters using the algorithm of Levinson. 


ey, (n) 


x(n) 


pen 


m-1 


en (n) 


FIGURE 7.4 
Direct-form structure for the computation of the mth-order forward and backward 
prediction errors. 


7.5.1 Lattice-Ladder Structures 


We note that the data vector for the (m + 1)st-order predictor can be partitioned in the 
following ways: 


Xm+i(n) = [x(n) x(n — 1) «+s x(n —m +1) xan—m]" 
= [x},(n) x(n — mJ" (7.5.3) 
=[x() xFqaa—1]? (7.5.4) 


Using (7.5.1), (7.5.3), (7.4.15), and (7.5.2), we obtain 


f am Din ” Txm(n — 1) 
Se ea | al ral a a ee 


= x(n) +alx,(n— 1 +k bo xn(n — 1D +20 —1—m)] 
or et41(n) = ey,(n) + kX e? (n — 1) (7.5.5) 
Using (7.4.11) and (7.4.15), we obtain the following Levinson-type recursion for the back- 


ward predictor: 
b : ct | es 
ae ~ Din am ss 


362 


CHAPTER 7 
Algorithms and Structures 
for Optimum Linear Filters 


The backward prediction error is 


H 
0 
b(n) = x(n — m= 1) + (|. El | eae 


=x(n—m—1) +b! xn (n — 1) + kmbx(n) + af Xm (n — 1)] 
or ema (2) = ey (nt — 1) + kmein (2) (7.5.6) 


Recursions (7.5.5) and (7.5.6) can be computed form = 0,1,...,M — 1. The initial 
conditions ef (n) and eb (n) are easily obtained from (7.5.1) and (7.5.2). The recursions also 
lead to the following all-zero lattice algorithm 


eh(n) — ed(n) = x(n) 
e(n)=el_(n)+k*_ je jm-1) m=1,2,...,M abe 
e?(n) =km-1e',_ jn) +e _,a—-1) m=1,2,...,M 

e(n) = el (n) 


that is implemented using the structure shown in Figure 7.5. The lattice parameters km are 
known as reflection coefficients in the speech processing and geophysics areas. 


Stage 1 Stage M 


en(n) e\(n) e iL (n) 


ey(n) y(n) ey (N) 


FIGURE 7.5 
All-zero lattice structure for the implementation of the forward and backward 
prediction error filters. 


The Levinson recursion for the optimum filter, (7.4.8) through (7.4.10), adds a ladder 
part to the lattice structure for the forward and backward predictors. Using (7.4.8), (7.5.7), 
and the partitioning in (7.5.3), we can express the filtering error of order m + 1 in terms of 
€m(n) and ae (n) as follows 


em41(2) = yn) — CF Xm41 (2) = em(n) — kere? (n) (7.5.8) 
form = 0,1,...,M — 1. The resulting lattice-ladder structure is similar to the one 


shown in Figure 7.3. However, owing to stationarity all coefficients are constant, and 
an (n) = kb (n) = km. We note that the efficient solution of the Mth-order optimum filtering 
problem is derived from the solution of the (M — 1)st-order forward and backward predic- 
tion problems of the input process. In fact, the lattice part serves to decorrelate the samples 
x(n), x(n—1),...,x(n—M), producing the uncorrelated samples eb (n), eP(n), Pitas er (n) 
(innovations), which are then linearly combined (“recorrelated”’) to obtain the optimum es- 
timate of the desired response. 


System functions. We next express the various lattice relations in terms of z-transforms. 
Taking the z-transform of (7.5.1) and (7.5.2), we obtain 


M 
Ei,(2) = (: +)> apne] X(z) 2 Am()X(z) (7.5.9) 


k=1 


M 
E°(z) = G + ee) X(z) © Bm(z)X(z) (7.5.10) 
k=1 
where A,,(z) and B,,(z) are the system functions of the paths from the input to the outputs 
of the mth stage of the lattice. Using the symmetry relation a,, = Jb;,, 1 < m < M, we 
obtain 
= 1 

Bn(z) = 2" At, (=) (7.5.11) 

Note that if zo is a zero of Am(z), then zy ' is a zero of Bm(z). Therefore, if Aj(z) is 


minimum-phase, then B,,(z) is maximum-phase. 
Taking the z-transform of the lattice equations (7.5.7), we have for the mth stage 


EL) = Ef,_\(@)+ke_7 1EX_1@ (7.5.12) 

EB? (z) = km-1EL,_ (+z 1 EP _1@) (7.5.13) 
Dividing both equations by X (z) and using (7.5.9) and (7.5.10), we have 

Am(2) = Am—1() + ki,_12' Bm—1(2) (7.5.14) 

Bm (Z) = km—1Am—1(2) + Z| Bm—1(2) (7.5.15) 


which, when initialized with 
Ao(z) = Bo(z) = 1 (7.5.16) 


describe the lattice filter in the z domain. 
The z-transform of the ladder-part (7.5.8) is given by 


Em+i(Z) = Em(z) — ko* E®.(z) (7.5.17) 


where E,,,(z) is the z-transform of the error sequence e,, (7). 


All-pole or “inverse” lattice structure. If we wish to recover the input x(n) from the 
prediction error e(n) = Eee (n), we can use the following all-pole lattice filter algorithm 


eh (n) = e(n) 


en) = et (n) —k*_ je? 1a -D m=M,M—1,...,1 peer 
eb(n) =e>_,(n—1) +km-1el,_ (2) m=M,M—1,...,1 


m—1 
x(n) = eh(n) = e(n) 
which is derived as explained in Section 2.5 and is implemented by using the structure in 
Figure 7.6. Although the system functions of the all-zero lattice in (7.5.7) and the all-pole 
lattice in (7.5.18) are Haz(z) = A(z) and Hap(z) = 1/A(z), the two lattice structures are 
described by the same set of lattice coefficients. The difference is the signal flow (see feed- 


back loops in the all-pole structure). This structure is used in speech processing applications 
(Rabiner and Schafer 1978). 


7.5.2 Some Properties and Interpretations 


Lattice filters have some important properties and interesting interpretations that make them 
a useful tool in optimum filtering and signal modeling. 


Optimal nesting. The all-zero lattice filter has an optimal nesting property when it is 
used for the implementation of an FLP. Indeed, if we use the lattice parameters obtained 
via the algorithm of Levinson-Durbin, the all-zero lattice filter driven by the signal x(n) 
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ey(n) = e(n) ex(n) x(n) = eq(n) 


en (n) 


FIGURE 7.6 
All-pole lattice structure for recovering the input signal from the forward prediction error. 


produces prediction errors ef (n) and e> (n) at the output of the mth stage forall 1 < m < M. 
This implies that we can increase the order of the filter by attaching additional stages without 
destroying the optimality of the previous stages. In contrast, the direct-form filter structure 
implementation requires the computation of the entire predictor for each stage. However, 
the nesting property does not hold for the all-pole lattice filter because of the feedback 
path. 


Orthogonality. The backward prediction errors e> (n) forO < m < Mare uncorrelated 
(see Section 7.2), that is, 


b bx _ J Pm k=m 
Bien Me, VOL= 4, — (7.5.19) 


and constitute the innovations representation of the input samples x(m),x(m — 1),..., 
x(n — m). We see that at a given time instant n, the backward prediction errors for orders 
m =0,1,2,..., M are uncorrelated and are part of a nonstationary sequence because the 
variance E je> (n)|?} = Pm depends on n. This should be expected because, for a given 
n, each e> (n) is computed using a different set of predictor coefficients. In contrast, for a 
given m, the sequence e> (n) is stationary for —oo <n < OO. 


Reflection coefficients. The all-pole lattice structure is very useful in the modeling of 
layered media, where each stage of the lattice models one layer or section of the medium. 
Traveling waves in geophysical layers, in acoustic tubes of varying cross-sections, and 
in multisectional transmission lines have been modeled in this fashion. The modeling is 
performed such that the wave travel time through each section is the same, but the sections 
may have different impedances. The mth section is modeled with the signals ef (n) and 
e> (n) representing the forward and backward traveling waves, respectively. 

If Z,, and Z,,— are the characteristic impedances at sections m and m — 1, respectively, 


then k,, represents the reflection coefficients between the two sections, given by 


kn . Zin _ Zm-1 


fa ae (7.5.20) 
Zin + Zm-1 


For this reason, the lattice parameters k,, are often known as reflection coefficients. As 
reflection coefficients, it makes good sense that their magnitudes not exceed unity. The 


termination of the lattice assumes a perfect reflection, and so the reflected wave eb (n) is 


equal to the transmitted wave ef (n). The result of this specific termination is an overall 
all-pole model (Rabiner and Schafer 1978). 


Partial correlation coefficients. The partial correlation coefficient (PCC) between 
x(n) and x(n — m — 1) (see also Section 7.2.2) is defined as the correlation coefficient 


between ef (n) and eb (n — 1), that is, 


PONG neon eee) (7.5.21) 


Ellen — DP Ete! @) 7} 


and, therefore, it takes values in the range [—1, 1] (Kendall and Stuart 1979). 
Working as in Section 7.2, we can show that 


E{e>(n — 1el*(n)} = b?r, +r(m t+ 1) = Bm (7.5.22) 
which in conjunction with 
E{le®(n — 1)|?} = Eflel, (n)|7} = Pn (7.5.23) 
and (7.4.16), results in 
kn = — Fn = —PCC{x(n — m — 1); x(n)} (7.5.24) 
m 


That is, for stationary processes the lattice parameters are the negative of the partial auto- 
correlation sequence and satisfy the relation 


Ikm| < 1 foralO<m<M-1 (7.5.25) 


derived also for (7.4.25) using an alternate approach. 


Minimum phase. According to Theorem 2.3 (Section 2.5), the roots of the polynomial 
A(z) are inside the unit circle if and only if 


lkm| < 1 foralO<m<M-1 (7.5.26) 


which implies that the filters with system functions A(z) and 1/A(z) are minimum-phase. 
The strict inequalities (7.5.26) are satisfied if the stationary process x(n) is nonpredictable, 
which is the case when the Toeplitz autocorrelation matrix R is positive definite. 


Lattice-ladder optimization. As we saw in Section 2.5, the output of an FIR lattice 
filter is a nonlinear function of the lattice parameters. Hence, if we try to design an optimum 
lattice filter by minimizing the MSE with respect to the lattice parameters, we end up with 
a nonlinear optimization problem (see Problem 7.11). In contrast, the Levinson algorithm 
leads to a lattice-ladder realization of the optimum filter through the order-recursive solution 
of a linear optimization problem. This subject is of interest to signal modeling and adaptive 
filtering (see Chapters 9 and 10). 


7.5.3 Parameter Conversions 
We have shown that the Mth-order forward linear predictor of a stationary process x(n) 


is uniquely specified by a set of linear equations in terms of the autocorrelation sequence 
and the prediction error filter is minimum-phase. Furthermore, it can be implemented using 


either a direct-form structure with coefficients a a” Cis a or a lattice structure 
with parameters ki, k2,..., ky. Next we show how to convert between the following 
equivalent representations of a linear predictor: 

1. Direct-form filter structure: {Py,d,,a2,...,ay}. 

2. Lattice filter structure: {Py, ko, ki,...,ky_1}. 

3. Autocorrelation sequence: {r(0),r(1),...,r(M)}. 


The transformation between the above representations is performed using the algo- 
rithms shown in Figure 7.7. 
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FIGURE 7.7 

Equivalent representations for 
minimum-phase linear prediction 
error filters. 


Step-up recursion 


Lattice-to-direct (step-up) recursion. Given the lattice parameters k,, kz, ..., kj and 
the MMSE error Py, we can compute the forward predictor ay by using the following 


recursions 
ay Ja* 
i bi j + ; | 4 (7.5.27) 


Pn = Py—-1 — [km—117) (7.5.28) 


form = 1,2,..., M. This conversion is implemented by the function [a, PM]=stepup (k). 


Direct-to-lattice (step-down) recursion. Using the partitioning 


= (m) _(m) (m) 4T 
a _ [a a ns ii ] 
. “ : — (7.5.29) 
Kin—1 = Am 
we can write recursion (7.5.27) as 
an = amn-1 + Jay ikm—1 
or by taking the complex conjugate and multiplying both sides by J 
Ja; = Jay, + Ais 
Eliminating J ar , from the last two equations and solving for a,;,—1, we obtain 
am — Jay kim—1 
an—1 = ————_ 7.5.30 
mt T= Men aren 
From (7.5.28), we have 
Pn—| = en (7.5.31) 
mt T= km aP = 
Given ay and Py, we can obtain k,, and P,, for0 < m < M—1 by computing the last two 
recursions form = M, M —1,...,2. We should stress that both recursions break down if 


|km| = +1. The step-down algorithm is implemented by the function [k]=stepdown (a). 


GB) G3) 


EXAMPLE 7.5.1. Given the third-order FLP coefficients a » dy, a3, compute the lattice 


parameters ko, ky, k2. 
Solution. With the help of (7.5.29) the vector relation (7.5.30) can be written in scalar form as 


km—1 = ahr” (7.5.32) 


al — gl 


ee i =| 
and a) = tm (7.5.33) 
1 = |kin—1| 
which can be used to implement the step-down algorithm form = M, M—1,...,2 andi = 
1,2,...,m — 1. Starting with m = 3 andi = 1, 2, we have 
(3) _ 3)x (3) (3)* 
Per) mee ©) is cn <> ss aa ee 
2= 43 1 TLE? 19. = ae pe 
1 — |ko| 1 — |ko| 
Similarly, for m = 2 andi = 1, we obtain 


(2) (2)* 
ay’ —a k 
hee aY gaye A 1 1 


= ——— =k 
} 1 — |ky|2 


which completes the solution. 


The step-up and step-down recursions also can be expressed in polynomial form as 


A =A k* 7m A* } 7.5.34 
m(Z) => m—1(Z) + m—1< m—1 7 ( a ) 
Am(z) — k* _,z7™ A* (1/z*) 
and Am-1(Z) = ” mt ie é 2?) 
1 — [km—1| 
respectively. 


Lattice parameters to autocorrelation. If we know the lattice parameters k, k2,..., 
ku and Py, we can compute the values r(0),r(1),...,7(M) of the autocorrelation se- 
quence using the formula 

rm +1) = —k* Pm — a! Jem (7.5.36) 


which follows from (7.4.16) and (7.4.17), in conjunction with (7.5.27) and (7.4.21) for 
m = 1,2,..., M. Equation (7.5.36) is obtained by eliminating £,,, from (7.4.9) and (7.4.10). 
This algorithm is used by the function r=k2r (k, PM). Another algorithm that computes the 
autocorrelation sequence from the lattice coefficients and does not require the intermediate 
computation of a,, is provided in Section 7.6. 


EXAMPLE 7.5.2. Given Po, kg, kj, and kz, compute the autocorrelation values r (0), 7(1), r(2), 
and r(3). 


Solution. Using r(O) = Po and 
r(m +1) = —k* Pm — al Jem 


for m = 0, we have 


r(1) = —ké Po 
Form = | 
% (1x 
r(2) = —ky P| — a; r(1) 
where P1 = Po(1 — |kol?) 


Finally, for m = 2 we obtain 


r(3) = —k3 Py — [al *r(2) + ker()] 
where Py = Pi - k1|7) 
and a =a) 4a ky = ky + kiki 


from the Levinson recursion. 


Direct parameters to autocorrelation. Given ay and Py, we can compute the auto- 
correlation sequence r(0), r(1), ..., 7M) by using (7.5.29) through (7.5.36). This method 


is knownas the inverse Levinson algorithm and is implemented by the function r=a2r (a, PM) . 
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7.6 ALGORITHM OF SCHUR 


The algorithm of Schiir is an order-recursive procedure for the computation of the lat- 


tice parameters kj, ko,..., ky of the optimum forward predictor from the autocorrelation 
sequence r(0), r(1),...,7(M) without computing the direct-form coefficients ay,,m = 
1,2,..., M. The reverse process is known as the inverse Schiir algorithm. The algorithm 


also can be extended to compute the ladder parameters of the optimum filter and the LDL”? 
decomposition of a Toeplitz matrix. The algorithm has its roots in the original work of 
Schiir (Schiir 1917), who developed a procedure to test whether a polynomial is analytic 
and bounded in the unit disk. 


7.6.1 Direct Schiir Algorithm 


We start by defining the cross-correlation sequences between ef (n), eb (n), and x(n) 
ELMS Eftxn—Del(n)} swith €4,() =0, forl <1 <m (7.6.1) 
&° (1) £ E{x(n—De™*(n)} with €> (1) =0, forO <1 <m (7.6.2) 


which are also known as gapped functions because of the regions of zeros created by the 
orthogonality principle (Robinson and Treitel 1980). 

Multiplying the direct-form equations (7.5.1) and (7.5.2) by x*(n — 1) and taking the 
mathematical expectation of both sides, we obtain 


ee =r) tating — 1) (7.6.3) 
and EP () =r(l—m) + bE) (7.6.4) 
where En) =(Mr@-V) + rd-m+ by)" (7.6.5) 


We notice that € a (1) and & De (/) can be interpreted as forward and backward autocorrelation 
prediction errors, because they occur when we feed the sequence r(0), r(1),...,7(m+ 1) 
through the optimum predictors a,, and b,, of the process x(n). Using the property b,, = 
Ja;,, we can show that (see Problem 7.29) 


£>@) =e (m -1) (7.6.6) 


If we set / = m + 1 in (7.6.3) and / = m in (7.6.4), and notice that r,(m) = Jr;,, then we 
have 


el mt 1) =r(m+ 1) +a" Jr* = p* (7.6.7) 
and Em(m) =O) +4, IBm = Pm (7.6.8) 
respectively. Therefore, we have 


f 
fm, em) (7.6.9) 


Pn &> (m) 


that is, we can compute k,,+1 in terms of ER (1) and gb (1). 
Multiplying the lattice recursions (7.5.7) by x*(m — J) and taking the mathematical 
expectation of both sides, we obtain 


EAD) =O =r) 
eM =e +k _ 6 U-1)  m=1,2,...,M (7.6.10) 
£5 =kn-1é5_,O+82_,0-D m=1,2,...,M 


which provides a lattice structure for the computation of the cross-correlations & rl ) and 
é P (J). In contrast, (7.6.7) and (7.6.8) provide a computation using a direct-form structure. 

In the next example we illustrate how to use the lattice structure (7.6.10) to compute the 
lattice parameters k1, k2,..., kj from the autocorrelation sequence r(0), r(1),...,r(M) 
without the intermediate explicit computation of the predictor coefficients ay. 


EXAMPLE 7.6.1. Use the algorithm of Schiir to compute the lattice parameters {kg, ky, ky} and 
the MMSE P3 from the autocorrelation sequence coefficients 


r(0) =3 rq) =2 r(2)=1 r3)= 5 
Solution. Starting with (7.6.9) for m = 0, we have 
eh) rt) 2 
gh) =r) 3 


0= 


because eh) — eb) = r(l). To compute k;, we need ef (2) and é>(1), which are obtained 
from (7.6.10) by setting / = 2. Indeed, we have 
&1 (2) = &(2) + kog$(1) = 1 + (-$)2 = -5 
EP(1) = £500) + koh) = 3 + (—F)2 = 3 = Py 
f al 
Oy Fo I 
b 5 
ga 3S 
The computation of ky requires & (3) and & sy which in turn need & 3) and & 4), These 
quantities are computed by 


and l= 


£83) = 13) +kebQ)=-1 41.424 
£b(2) = eb) + mef2y= 4th =8 


and the lattice coefficient is 


1 
oe #3) a 1 
_ b a 8 
ga) 8 
The final MMSE is computed by 


P3 = Po(1 — [kol?) = 8 - 3) = 3 


although we could use the formula gD (m) = Pm as well. Therefore the lattice coefficients and 
the MMSE are found to be 


It is worthwhile to notice that the k,, parameters can be obtained by “feeding” the se- 
quence r(0), r(1), ..., “(M) through the lattice filter as a signal and switching on the stages 
one by one after computing the required lattice coefficient. The value of k,, is computed at 
time n = m from the inputs to stage m (see Problem 7.30). 

The procedure outlined in the above example is known as the algorithm of Schiir and 
has good numerical properties because the quantities used in the lattice structure (7.6.10) 
are bounded. Indeed, from (7.6.1) and (7.6.2) we have 


EL OP? < |Etlel, @)PHE{le(2 — DPV < Pnr (0) < r?(0) (7.6.11) 
E> (DP < [Efle® @PHIE{Ix(2 — DI?7}1 < Pur) <r?) (7.6.12) 
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because P», < Po = r(O). As a result of this fixed dynamic range, the algorithm of Schiir 
can be easily implemented with fixed-point arithmetic. The numeric stability of the Schiir 
algorithm provided the motivation for its use in speech processing applications (LeRoux 


and Gueguen 1977). 


7.6.2 Implementation Considerations 


Figure 7.8 clarifies the computational steps in Example 7.4.2, using three decomposition 
trees that indicate the quantities needed to compute ko, k,, and kz when we use the lattice 
recursions (7.6.10) for real-valued signals. We can easily see that the computations for ko 
are part of those for k;, which in turn are part of the computations for kz. Thus, the tree for 
k2 includes also the quantities needed to compute kp and k,. The computations required to 
compute kg, k,, ka, and k3 are 
f 
ko = Se! 9. £5(4) = &1(4) + E73) 
(0) 


2. 4 (4) = &5(4) + ko&$G) 10. EBB) = €9(2) + ET) 
3. E903) = E82) + koH(3) 1. 4 (3) = &4 (3) + rE4(2) 
4. €4(3) = 8(3) + ko&O(2) 12. E82) = EPC) +k €4 (2) 
_ &@) 
£3(2) 
6. &)(2) = £52) + kosH(l) 14. E54) = E5(4) + 8B) 
7. EPC) = ERO) + ko&hCl) 1S. E83) = E4(2) + E45.) 
f f 
0 Le, 


With the help of the corresponding tree decomposition diagram, this can be arranged as 
shown in Figure 7.9. The obtained computational structure was named the superlattice 
because it consists of a triangular array of latticelike stages (Carayannis et al. 1985). Note 
that the superlattice has no redundancy and is characterized by local interconnections; that 
is, the quantities needed at any given node are available from the immediate neighbors. 

The two-dimensional layout of the superlattice suggests various algorithms to perform 
the computations. 


a 
we 


P22) =&30) +ho8§@) 13. b= 


1. Parallel algorithm. We first note that all equations involving the coefficient kj, constitute 
one stage of the superlattice and can be computed in parallel after the computation of 
ky» because all inputs to the current stage are available from the previous one. This 
algorithm can be implemented by 2(M — 1) processors in M — 1 “parallel” steps (Kung 
and Hu 1983). Since each step involves one division to compute k,, and then 2(M — 
m) multiplications and additions for the parallel computations, the number of utilized 
processors decreases from 2(M — 1) to 1. The algorithm is not order-recursive because 
the order M must be known before the superlattice structure is set up. 

2. Sequential algorithm. Asequential implementation of the parallel algorithm is essentially 
equivalent to the version introduced for speech processing applications (LeRoux and 
Gueguen 1977). This algorithm, which is implemented by the function k=schurlg (r,M) 
and summarized in Table 7.4, starts with Equation (1) and computes sequentially Equa- 
tions (2), (3), etc. 
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é\(1) =r) 


£30) =r) 


£43) = (3) 


£1) < 
€5(2) = r(2) 


€5(3) 
£0(2) = r(2) 
é1(2) 
e(l) =r) 
ky eee ne 
£42) = r(2) 
| €1Q) < 
ev) =r(1) 
£32) | 
| gh) = rd) 
EV(1) ~< 
&(0) =r) 
FIGURE 7.8 


Tree decomposition for the computations required by the algorithm of Schiir. 


3. Sequential order-recursive algorithm. The parallel algorithm starts at the left of the su- 
perlattice and performs the computations within the vertical strips in parallel. Clearly, 
the order M should be fixed before we start, and the algorithm is not order-recursive. 
Careful inspection of the superlattice reveals that we can obtain an order-recursive al- 
gorithm by organizing the computations in terms of the slanted shadowed strips shown 
in Figure 7.9. Indeed, we start with ko and then perform the computations in the first 
slanted strip to determine the quantities é1 (2) and & ®(1) needed to compute k,. We 
proceed with the next slanted strip, compute k2, and conclude with the computation 
of the last strip and k3. The computations within each slanted strip are performed 
sequentially. 

4. Partitioned-parallel algorithm. Suppose that we have P processors with P < M. This al- 
gorithm partitions the superlattice into groups of P consecutive slanted strips (partitions) 
and performs the computations of each partition, in parallel, using the P processors. It 
turns out that by storing some intermediate quantities, we have everything needed by 
the superlattice to compute all the partitions, one at a time (Koukoutsis et al. 1991). This 
algorithm provides a very convenient scheme for the implementation of the superlattice 
using multiprocessing (see Problem 7.31). 
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Eq(4) = r(4) 
£03) = r(3) 
€(3) =r) 
&)(2) = r(2) 
f 
€(2) = r(2) 
é(1) = r(1) 
Eyl) =r(1) 


£3(0) = r(0) ie? 


FIGURE 7.9 

Superlattice structure organization of the algorithm of Schiir. 
The input is the autocorrelation sequence and the output the 
lattice parameters. 


TABLE 7.4 
Summary of the algorithm of Schiir. 


1. Input: {r(J)}\/ 


2. Initialization 
(a) For! =0,1,..., M 


ho =8O=rO 
(b) ko = 


(c) Pi =r()(1 = |ky 7) 
3. Form =1,2,...,M—-1 
(a) For! =m,m-+1,..., M 
eh =e _ ;O+k_ 6 0-1 
EP) = ky—18!,_,O+8>_ CU -D 
éf(m +1) 
é (m) 
(c) Ping = Pm(1 = |km') 


(b) km = 


4. Output: {kmn}f!—!, {Pn} 


Extended Schiir algorithm. To extend the Schiir algorithm for the computation of the 
ladder parameters kf, we define the cross-correlation sequence 
EC (1) £ E{x(n— Dex (D} with €¢ (J) =0,forO<l<m (7.6.13) 
due to the orthogonality principle. Multiplying (7.5.8) by x*(n — 1) and taking the mathe- 
matical expectation, we obtain a direct form 


EoD) = di41 — ch itm) (7.6.14) 


and a ladder-form equation 
En 1D = Em) — kn Em 
For / = m, we have 


Eo (m) = di41 — C4 Jem = BS, 


2 8S SE) 


d k eS 
y ™ = "Pn €°(m) 


that is, we can compute the sequence k¢, using a lattice-ladder structure. 


(7.6.15) 


(7.6.16) 


(7.6.17) 


The computations can be arranged in the form of a superladder structure, shown in 
Figure 7.10 (Koukoutsis et al. 1991). See also Problem 7.32. In turn, (7.6.17) can be used in 
conjunction with the superlattice to determine the lattice-ladder parameters of the optimum 


FIR filter. The superladder structure is illustrated in the following example. 


&(3) =d, FIGURE 7.10 
Graphical illustration of the 
superladder structure. 


£33) =r@) 


(2) =d, 
€5(2) = r(2) 


g6(1) =d, 
eo(1) =r() 


(0) =, of0' 
£5(0) = r(0) O10; 


EXAMPLE 7.6.2. Determine the lattice-ladder parameters of an optimum FIR filter with input 
autocorrelation sequence given in Example 7.6.1 and cross-correlation sequence dj = 1, dy = 2, 


and d3 = 3, using the extended Schiir algorithm. 


Solution. Since the lattice parameters were obtained in Example 7.6.1, we only need to find the 
ladder parameters. Hence, using (7.6.15), (7.6.17), and the values of & (/) computed in Example 


7.6.1, we have 


860 a 
o 8G r@ 3 
Cc Cc Cc 1 
EC(1) = €6(1) + OER) =2— 72) = 
Cc Cc ceb 5 1 
8{Q) = 652) + 8@=5- 70 = 
. &M F 4 

ebay 3 
6 (2) = &6(2) + 680(2) = 2 — 2(5) = 
§5(2) = §} b1iQ=—-sGQ)= 
2 __ i __ 


> a 8 16 


4 
3 


11 
10 
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which provide the values of the ladder parameters. These values are identical to those obtained 
in Example 7.4.3. 


7.6.3 Inverse Schiir Algorithm 


The inverse Schiir algorithm computes the autocorrelation sequence coefficients r(0), 
r(1),...,7(m) from the lattice parameters ko, ki, ..., ky and the MMSE Py of the linear 
predictor. The organization of computations is best illustrated by the following example. 


EXAMPLE 7.6.3. Given the lattice filter coefficients 


ko = -% ket b= -t 


and the MMSE P3 = 51/32, compute the autocorrelation samples r(0), r(1), r(2), and r(3), 
using the inverse Schiir algorithm. 


Solution. We base our approach on the part of the superlattice structure shown in Figure 7.9 
that is enclosed by the nodes & 50), é A); é 1G); and é B(2). To start at the lower left corner, we 
compute r (0), using (7.4.30): 
51 
P al 
(O53 5 a3 
(1 — 9) — 55) — 556) 


[| -kh 
m=0 


This also follows from (7.5.31). Then, continuing the computations from the line defined by r (0) 
and é5(2) to the node defined by éh(3) = r(3), we have 


r(1) = —kor(0) = ~(-3)3 =2 

&O(1) = E800) + kogh(1) = 3 + (—3)2 = 2 

sf (2) = ebay =-41(8) =-4 

r(2) = €£(2) = €f 2) — kg) = -1 -(-2)2 = 1 

&D(2) = EO(1) + ko&G(2) = 2 + (-9)1 = F 

gba) = ebay + mein = 842-4 =8 

£53) = —88Q) = -C 28) = 

gf 3) = ef) — keh) = F-1(4y=-1 

r(3) = €5(3) = &f 3) — kog82) =-1 - (21 =} 
as can be easily verified by the reader. Thus, the autocorrelation sequence is 

r0)=3 rdj=2  rQ=1 r@=! 


which agree with the autocorrelation sequence coefficients used in Example 7.6.1 with the direct 
Schiir algorithm. 


The inverse Schiir algorithm is implemented by the function r=invschur (k, PM), which 
follows the same procedure as the previous example. 


7.7 TRIANGULARIZATION AND INVERSION OF TOEPLITZ MATRICES 
In this section, we develop LDL”? decompositions for both Toeplitz matrices and the inverse 


of Toeplitz matrices, followed by a recursion for the computation of the inverse of a Toeplitz 
matrix. 


7.7.1 LDL? Decomposition of Inverse of a Toeplitz Matrix 


Since R,, is a Hermitian Toeplitz matrix that also happens to be persymmetric, that is, 


JRnJ = R;*,, taking its inverse, we obtain 


IR, 'J = (Ri)! (7.7.1) 
The last equation shows that the inverse of a Toeplitz matrix, although not Toeplitz, is 


persymmetric. From (7.1.58), we recall that the BLP coefficients and the MMSE Po provide 
the quantities for the UDU’ decomposition of R,,41> that is, 


Raj BPs Bat (7.7.2) 
1 0 + 0 
be - 0 0 

where Bn4i = : “oo : (7.7.3) 
i ee a 0 
HP 

and Dm+i = diag {P?, P>,..., P>) (7.7.4) 


For a Toeplitz matrix Rj, we can obtain the LDL” decomposition of its inverse by using 
(7.7.2) and the property J = J~! of the exchange matrix. Starting with (7.7.1), we obtain 


(R¥ | = JR) JS = OBE, JID, ) I) Bn 41S) (7.7.5) 
If we define 
An+1 = JBy,)J (7.7.6) 
= -1 : 
and Dmtit IDniiJ = diag {Pn, Pn—1,---» Po} (7.7.7) 
then (7.7.2) gives Rey = AD Ams (7.7.8) 


which provides the unique LDL”? decomposition of the matrix R,, 4 i 


GW) be; we can write matrix 


Indeed, using the 


property aj; = Jbj for 1 < j < m, or equivalently a; j 


Am+1 = JBi aid as 


1 ae a se 2 am 
(m—1)* (m—1)* 
01 a) m—1 
Ama= |i: 0 8 (7.7.9) 
0 0 es me 7 ai 
0 0 0 see] 


which is an upper unit triangular matrix. We stress that the property JB* 413 = Am+1 and 
the above derivation of (7.7.8) hold for Toeplitz matrices only. However, the decomposition 
in (7.7.2) holds for any Hermitian, positive definite matrix (see Section 7.1.4). 

AS we saw in Section 6.3, the solution of the normal equations Re = d can be obtained 


in three steps as 


R = LDL’! = LDkK‘=d > L%c =k*° (7.7.10) 


where the LDL” decomposition requires about M?/6 flops and the solution of each triangu- 
lar system M*/2 flops. Since R~! = B’D~'B, the Levinson-Durbin algorithm performs 
the UDU” decomposition of R~! when R is Toeplitz, at a cost of M? flops; that is, it 
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reduces the computational complexity by an order of magnitude. The Levinson recursion 
for the optimum filter is equivalent to the solution of the two triangular systems and requires 
M? operations. 


EXAMPLE 7.7.1. Compute the lattice-ladder parameters of an MMSE finite impulse response 
filter specified by the normal equations 


3 2 17) [n0) 
3 8) eS 
1. O° 3 RO) 


Nin Ne 


using two different approaches: the LDL” decomposition and the algorithm of Levinson. 


Solution. The LDL” decomposition of R is 


1 0 0O 3 0 0 1 0 0 
—|2 = P) -1_]_2 
L=| 3 1 0 D=|0 3 0 LU= 3 1 0 

1 4 8 1 4 

a3 OO as a es 
and using (7.3.31), we have 

‘i 
c_p'ptga/i4u 
Bae a [; 5 ral 

which gives the three ladder parameters. The two lattice parameters are obtained by solving the 


system 

LP Iplitks =r5 with = [1 27 
which gives kg = ; and ky = ;. The results agree with those obtained in Example 7.4.3 using 
the algorithm of Levinson. We also note that the rows of L7! provide the first- and second-order 


forward and backward linear predictors. This is the case because the matrix is Toeplitz. For 
symmetric matrices the LDL? decomposition provides the backward predictors only. 


7.7.2 LDL? Decomposition of a Toeplitz Matrix 


The computation of the LDL” decomposition of a symmetric, positive definite matrix 
requires on the order of M* computations. In Section 7.1, we saw that the cross-correlation 
between x(n) and e> (n) is related to the LDL?” decomposition of the correlation matrix R),. 
We next show that we can extend the Schiir algorithm to compute the LDL” decomposition 
of a Toeplitz matrix with O(M 2) computations using the cross-correlations € v (1). 

To illustrate the basic process, we note that evaluating the product on the left with the 
help of (7.6.4), we obtain 


FO) 40) rr] [1 oy BP" OO") eh a oO 
r(1) r@) rQ) r(2)]}o 1  BP* pP*! J ebay ebay) Oo 0 
r2) r() r@ rd)}}o 0 1  pG*| | ebay Eby E82) 0 
r3) r2) rd) r@J|o00 oOo 1 £B(3) &8(3) €8(3) &8(3) 
that is, a lower triangular matrix L: which can be written as 
1 0 0 0 
a0, 1 0 0} [Po 9 O O 
i ; Oo: DEE SOE wees 
aOR SR 4) ro. TO Bie Zou 
Po P| 


(3) &0(3)—-&8(3) 
Po Pi P2 


1 


because Py, = gb (m) > 0. Therefore, RB# = LD and since R is Hermitian, we have 
R =LDB-” = B"'!DL”, which implies that B~! = L. This results in the following 
LDL? factorization of the (M+ 1) x (M +1) symmetric Toeplitz matrix R 


R=LDL? (7.7.11) 
1 0 tex 0 0 
eae a 0 
where L=B =| 82) He) 0 0 (7.7.12) 
Ehcm) EM) --. Ey \(M) 1 
shies Go EO 
Sn) = Gay P,, (7.7.13) 
and D = diag {Po, P},..., Pu} (7.7.14) 


The basic recursion (7.6.10) in the algorithm of Schiir can be extended to compute the ele- 
ments of L and hence the LDL” factorization of the Toeplitz matrix R (see Problem 7.33). 
Since a Toeplitz matrix is persymmetric, that is, JRJ = R*, we have 


R = JR*J = IL‘ JDJ) IL” J) 4 udu" (7.7.15) 


which provides the UDU” decomposition of R. Notice that the relation U = JL*J also can 
be obtained from A = JB*J [see (7.4.11)], which in turn is a consequence of the symmetry 
between forward and backward prediction for stationary processes. 

The validity of (7.6.10) also can be shown by computing the product 


r0) rd) r@) r@y7f} 9 9 
r(l) r(0) rd) r@y|fap* 1 0 


0 
RA” = : (7.7.16) 
1) FOS FO) TO hae ae £0 “ 
2) AO) tO OVI gO gs Org 
E80) é§(-1) &4(-2)  &}(-3) 
f £72: fe_ 
_|0 é£(0) £1 1) $0 2) Hate 
0 0 gf)  éh(-1) 
0 0 0 E10) 


with the help of (7.6.3) and r(—l) = r* (1). The formula U = JL*J relates et (J) and gb (J), 
as expected by (7.6.10). 


7.7.3 Inversion of Real Toeplitz Matrices 


From the discussion in Section 7.1, it follows from (7.1.12) that the inverse Qy of a 
symmetric, positive definite matrix Ry is given by 


Qu = a 4 (7.7.18) 
q’ gq 
. b 
with q=5 (7.7.19) 
1 
I=>5 (7.7.20) 
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1 
and Q=R'+ pbb! (7.7.21) 


as given by (7.1.18), (7.1.19), and (7.1.21). The matrix Q is an (M — 1) x (M — 1) matrix, 
and b is the (M — 1)st-order BLP. Next we show that for Toeplitz matrices we can compute 
Qw with O(M’) computations. 

First, we note that the last column and the last row of Qy can be obtained by solving 
the Toeplitz system Rb = —Jr using the Levinson-Durbin algorithm. Then we show that 
we can compute the elements of Q by exploiting the persymmetry property of Toeplitz 
matrices, moving from the known edges to the interior. Indeed, since R is persymmetric, 
thatis,R = JRJ, wehaveR7! = JR-'J, that is, R~! is also persymmetric. From (7.7.21), 
we have 


(Q)ij = (Rij + aig; = (Ro) —j.m—i + Pai (7.7.22) 
because R™! is persymmetric, and 
(Ro!) m—j,m—i = (Q)m—j,m—i — Pam—jqm-i (7.7.23) 
Combining (7.7.22) and (7.7.23), we obtain 
(Q)ij = (Q)u—j,m—i — Pia; — Im—jam-i) (7.7.24) 


which in conjunction with persymmetry makes possible the computation of the elements 
of Q from q and q. The process is illustrated for M = 6 in the following diagram 
Pi Pi pi pi pi k 
Pl P2 P2 Pp2 Uy 
Pl P2 P3 U2 Uy 
Pl Pp2 U2 U2 Uy 
Pi uy uy uy Uy 
kK k k k ok 


© 
lon 
ll 
i 


= 


where we start with the known elements k and then compute the u elements by using the 
updating property (7.7.22) and the elements p by using the persymmetry property (7.7.24) 
in the following order: k > p; — uj —> p2 > u2 — p3. Clearly, because the matrix 
Qu = Ry is both symmetric and persymmetric, we need to compute only the elements 
in the following wedge: 


PPL PL Pt opt Ck 
P2 pP2 p2 Ui 
P3 U2 
which can be easily extended to the general case. This algorithm, which was introduced by 
Trench (1964), requires O(M7) operations and is implemented by the function 
Q=invtoepl] (r,M) 


The algorithm is generalized for complex Toeplitz matrices in Problem 7.40. 


7.8 KALMAN FILTER ALGORITHM 


The various optimum linear filter algorithms and structures that we discussed so far in this 
chapter provide us with the determination of filter coefficients or optimal estimates using 
some form of recursive update. Some algorithms and structures are order-recursive while 
others are time-recursive. In effect, they tell us how the past values should be updated to 


determine the present values. Unfortunately, these techniques do not lend themselves very 
well to the more complicated nonstationary problems. Readers will note carefully that the 
only case in which we obtained efficient order-recursive algorithms and structures was in 
the stationary environment, using the approaches of Levinson and Schiir. 

In 1960, R. E. Kalman provided an alternative approach to formulating the MMSE 
linear filtering problem using dynamic models. This “Kalman filter” technique was quickly 
hailed as a practical solution to a number of problems that were intractable using the more 
established Wiener methods. As we see in this section, the Kalman filter algorithm is actually 
a special case of the optimal linear filter algorithms that we have studied. However, it is 
used in a number of fields such as aerospace and navigation, where a signal trajectory can 
be well defined. Its use in statistical signal processing is somewhat limited (adaptive filters 
discussed in Chapter 10 are more appropriate). The two main features of the Kalman filter 
formulation and solution are the dynamic (or state-space) modeling of the random processes 
under consideration and the time-recursive processing of the input data. 

In this section, we discuss only the discrete-time Kalman filter. The continuous-time 
version is covered in several texts including Gelb (1977) and Brown and Hwang (1997). 
As a motivation to this approach, we begin with the following estimation problem. 


7.8.1 Preliminary Development 


Suppose that we want to obtain a linear MMSE estimate of a random variable y using the 
related random variables (observations) {x1, x2,..., Xm}, that is, 


Sm = EB (yl eis en Rad (7.8.1) 


as described in Section 7.1.5. Furthermore, we want to obtain this estimate in an order- 
recursive fashion, that is, determine },, in terms of },,—1. We considered and solved this 
problem in Section 7.1. Our approach, which is somewhat different from that in Section 7.1, 
is as follows: Assume that we have computed the corresponding estimate 3,,_1, we have the 
observations {x,, x2,..., Xm}, and we wish to determine the estimate },,. Then we carry 
out the following steps: 


1. We first determine the optimal one-step prediction of xm, that is, 


7 A 
Xm|m—-1 = {Xm|X1,X2,---,Xm—1} 


-1 Jb H H 
= [R,,-1lin—1] Xm-1 = —Dn—1Xm-1 


m-1 
-1 
=— ee Ts 
k=1 


where the vector and matrix quantities are as defined in Section 7.1. 
2. When the new data value x,, is received, we determine the optimal prediction error 


b A a 
Cn = Xm — Xm|m—-1 = Wm (7.8.3) 


(7.8.2) 


which is the new information or innovations contained in the new data. 
3. Determine a linear MMSE estimate of y, given the new information w,,: 


E{y|Wm} = E {Ym wig} (E {wm Wy })! Wm (7.8.4) 
4. Finally, form a linear estimate ),, of the form 
dm = Fm—1 ot Efy|wm} a Jm-1 a E{ymw* }(E{wmw*,})~ lw (7.8.5) 


The algorithm is initialized with jo = 0. Note that the quantity E {yw} (E{wm wey 
is equal to the coefficient k*, and that we have rederived (7.1.51). For the implementation 
of (7.8.5), see Figure 7.1. 
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EXAMPLE 7.8.1. Let the observed random data be obtained from a stationary random process; 
that is, the data are of the form 


{x(1), x(2),..., x(n), ...} r(n,l) =r(n—1) 


Also instead of estimating a single random variable, we want to estimate the sample y(n) of a 
random process {y(7)} that is jointly stationary with x(n). Then, following the analysis leading 
to (7.8.5), we obtain 


n—1 
A A a =] 
$(n) = $2 =I) + Kw(n) =F — D+ Ke) + DO PP) 7.8.6) 
k=0 
It is interesting to note that, because of stationarity, we have a time-recursive algorithm in (7.8.6). 
The coefficients {k**} can be obtained recursively by using the algorithms of Levinson or Schiir. 
However, the data prediction term does require a growing memory. Indeed, if we define the 
vector 


x(n) = [x(1) x(2) «+» x(n)]" 


whose order is equal to time index n, we have 
n 
Sin) = lel Fex(ky) Se x(n) 
k=1 


The optimum estimator is given by 
Rnen = dp 
where R, & E{x(n)x" (n)} dn = E{x(n)y*(n)} 


Since, owing to stationarity, the matrix Ry, is Toeplitz, we can derive a lattice-ladder structure 
{kn, k¢} that solves this problem recursively (see Section 7.4). When each new observation 
{y(n + 1)} is received, we use the moments r(n + 1) and d(n + 1) to compute new lattice- 
ladder parameters {kp+1, ke ap and we add a new stage to the “growing-order” (and, therefore, 
growing-memory) filter. 


The above example underscores two problems with our estimation technique if we were 
to obtain a true time-recursive algorithm with finite memory. The first problem concerns the 
time-recursive update for the k;) term or, in particular, for E{y,w;,} and (E{wnw;, })— |. We 
alluded to this problem in Section 7.1. In the example, we solved this problem by assuming 
a stationary signal environment. The second problem deals with the infinite memory in 
(7.8.2). This problem can be solved if we are able to compute the data prediction term also 
in a time-recursive fashion. In the stationary case, this problem can be solved by using the 
Levinson-Durbin or Schiir algorithm. For nonstationary situations, the above two problems 
are solved by the Kalman filter by assuming appropriate dynamic models for the process to 
be estimated and for the observation data. 

Consider the optimal one-step prediction term in (7.8.2), defined as 


K(n|n — 1) & E{x(n)|x(0),...,x(n — 1} (7.8.7) 

which requires growing memory. If we assume the following linear data relation model 
x(n) = H(n)y(n) + v(n) (7.8.8) 
with E{v(n)y*(D} = 0 for all n,/ (7.8.9) 
E{v(nyu*(D} = ron for all n, 1 (7.8.10) 


then (7.8.7) becomes 
X(n|n — 1) = E{[H(n)y(n) + v(m) ]|x(0), ..., x(a — D} 
= H(n)y(n|n — 1) 
where we have used the notation 


$(n|n — 1) © E{y(n)|x(0),...,x(n — D} (7.8.12) 


(7.8.11) 


Thus, we will be successful in obtaining a finite-memory computation for x(n|n — 1) if we 
can obtain a recursion for }(n|n — 1) in terms of y(n — 1|n — 1). This is possible if we 
assume the following linear signal model 


y(n) =a(n — Iya — 1) + n(n) (7.8.13) 


with appropriate statistical assumptions on the random process n(n). Thus it is now possible 
to complete the development of the Kalman filter. The signal model (7.8.13) provides the 
dynamics of the time evolution of the signal to be estimated while (7.8.8) is known as the 
observation model, since it relates the signal y(n) with the observation x (1). These models 
are formally defined in the next section. 


7.8.2 Development of Kalman Filter 


Since the Kalman filter is also well suited for vector processes, we begin by assuming that 
the random process to be estimated can be modeled in the form 


y(n) = A(n — Ly(n — 1) + B(x) (n) (7.8.14) 
which is known as the signal (or state vector) model where 


y(n) =k x 1 signal state vector at time n 
A(n — 1) =k x k matrix that relates y(n — 1) to y(n) in absence of a forcing function 
n(n) = k x 1 zero-mean white noise sequence with covariance matrix R,(n) 
B(n) =k x k input matrix 
(7.8.15) 
The matrix A(n — 1) is known as the state-transition matrix while n(n) is also known as 
the modeling error vector. 
The observation (or measurement) model is described using the linear relationship 


x(n) = H(n)y(n) + vin) (7.8.16) 
where 
x(n) = m x | signal state vector at time n 


H(n) = m x k matrix that gives ideal linear relationship between y(n) and x(n) 


v(n) = k x 1 zero-mean white noise sequence with covariance matrix R, (n) 
(7.8.17) 
The matrix H(7) is known as the output matrix, and the sequence v() is known as the 
observation error. 
We further assume the following statistical properties: 


Efy(nv4(D} =0 forall n,/ (7.8.18) 
E{y(n)v"(D)} =0 for alln,/ (7.8.19) 
E{y(n)y"(—1)} =0 __ foralln (7.8.20) 
E{y(-1)} =0 (7.8.21) 
E{y(—Dy” (—D} = Ry(-1) (7.8.22) 


The first three relations, (7.8.18) to (7.8.20), imply orthogonality between respective random 
variables while the last two, (7.8.21) and (7.8.22), establish the mean and covariance of the 
initial-condition vector y(—1). 

From (7.8.14) and (7.8.21) the mean of y(n) = 0 for all n, and the evolution of its 
correlation matrix is given by 


Ry(7) = A@— DRy (2 — NA" (n — 1) + B@)R, (0) B? (n) (7.8.23) 
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From (7.8.16), the mean of x(n) = 0 for all n, and from (7.8.23) the evolution of its 
correlation matrix is given by 


R, (7) = H@)[A(n — Ry — DA (n — 1) 


7.8.24 
+ B)R,(1)B" (n) JH" (n) + Ry(n) 


Evolution of optimal estimates 


We now assume that we have available the MMSE estimate y(n — 1|n — 1) of y(n — 1) 
based on the observations up to and including time n — 1. Using (7.8.14) and (7.8.20), the 
one-step prediction of y(7) is given by 


J(n|n — 1) =A(n — D9 — In— 1) (7.8.25) 


with initial condition y(—1| — 1) = y(—1). From (7.8.16), the one-step prediction of x(n) 
is given by 


R(n|n — 1) =H()$H(n|n — 12) = HDAC — D§(n— In — 1) (7.8.26) 


Thus we have a recursive formula to compute the predicted observation. The prediction 
error (7.8.3) from (7.8.16) is now given by 


w(n) = x(n) — X(n|n — 1) 
= H(n)y(n) + v(n) — H()¥(n|n — 1) (7.8.27) 
= Hin)y(n|n — 1) + v(m) 


where we have defined the signal prediction error 


Finln — 1) £ y(n) —H|n - 1) (7.8.28) 

Now the quantity corresponding to E{w,,w;,} in (7.8.5) is given by 
R,,(n) = E{w(n)w (n)} = H(n)R5(n|n — 1H" (n) + Ry (n) (7.8.29) 
where R5(m|n — 1) £ Ef¥(n|n — D¥" (n\n — 1)} (7.8.30) 


is called the prediction (a priori) error covariance matrix. Similarly, from (7.8.27) the 
quantity corresponding to E{y,,w;, } in (7.8.5) is given by 
Efy(n)w" (n)} = Efy(n)ly" (ain — DH" (n) + v" (n)]} 
= E{[y(n|n —1) + ¥@|n — 1] 
x [V4 (nin — 1)H# (n) + v4 (n)]} (7.8.31) 
= E(yn|n — 19" (nn — DH" (n) 
= R5(n|n — 1)H" (n) 
since the optimal prediction error y(n|n — 1) is orthogonal to the optimal prediction 


y(n|n — 1). Now the updated MMSE estimate (which is also known as the filtered es- 
timate) corresponding to (7.8.5) is 


¥(n|n) = Yin — 1) + R5(n|n — DH" RZ! (n){x(2) — X(n|n — D} 


- 7 (7.8.32) 
= y(n|n — 1) + K(@){x(@) — Hin)y(n|n — 1)} 
where we have defined a new quantity 
K(n) 2 R5(n|n — 1)" (n)R;,' (n) (7.8.33) 


which is known as the Kalman gain matrix and where y(n|n — 1) is given in terms of 
y(n — 1|n — 1) using (7.8.25). Thus we have 


Prediction: — $(n|jn — 1) = A(n — D9(n — In — 1) 


. ‘ : (7.8.34) 
Filter: §(n|n) = $(n|n — 1) + K(n){x(n) — Hn) $(n|n — D} 


and we have succeeded in obtaining a time-updating algorithm for recursively computing 
the MMSE estimates. All that remains is a time evolution of the gain matrix K(7). Since 
R,, (2) from (7.8.29) also depends on R;(1|n — 1), what we need is an update equation for 
the error covariance matrix. 


Evolution of error covariance matrices 
First we define the filtered error as 
¥(n|n) = y(n) — F(nln) 
= y(n) — Yn — 1) — K(a){x(n) — Hn) ym|n — 1} (7.8.35) 
= y(n|n — 1) —K()w(n) 
where we have used (7.8.27) and (7.8.34). Then the filtered error covariance is given by 
R53 (n\n) = E(¥n|n)¥" (n|\n)} 

= R3(a|n — 1) - K(n)R, (n)K# (n) 
=R5(a|n — 1) - K()R,y (n)R,,!()H(@) R53 (n\n -1) 
= [I — K(n)H(n)JR5(n|n — 1) 


(7.8.36) 


where in the second-to-last step we substituted (7.8.33) for K" (n). The error covariance 
Rj (”|n) is also known as the a posteriori error covariance. Finally, we need to determine 
the a priori prediction error covariance at time n from R;(n — 1|n — 1) to complete the 
recursive calculations. From the prediction equation in (7.8.34), we obtain the prediction 
error at time n as 


y(n) — y(n|n — 1) = A(n — Iy(n — 1) + B(X*)n(n) — A(n — 1)9(n — In — 1) 
y(n|[n — 1) = A(n — 1)¥( — 1]n — 1) + BQ) g(r) 
or Ryan — 1) = A(n— DRs(n — In — DA4 (2 — 1) + BAR, (2)B" (0) (7.8.38) 


(7.8.37) 


with initial condition R;(—1| — 1) = R,(—1). Thus we have 
A priori error covariance: = R5(n|n — 1) = A(n — 1I)DR5(0 — In — 1)A#(n — 1) 
+ B(n)R, (n)B# (n) 
Kalman gain: =K(n) = Rj(n|n — 1)H” (n)R5! (0) 
A posteriori error covariance: Rj(n|n) = [I— K(2)H(1)]R5(a|n — 1) 
(7.8.39) 
The complete Kalman filter algorithm is given in Table 7.5, and the block diagram descrip- 
tion is provided in Figure 7.11. 


EXAMPLE 7.8.2. Let y(n) be an AR(2) process described by 
y(n) = 1.8y(v — 1) —0.8ly(m — 2) + O0.1n(n) n>=0 (7.8.40) 


where n(n) ~ WGN(O, 1) and y(—1) = y(—2) = 0. We want to determine the linear MMSE 
estimate of y(n), n > 0, by observing 


x(n) = y(n) + V10v(n) sn > 0 (7.8.41) 
where v(n) ~ WGN(O, 10) and orthogonal to n(n). 


Solution. From (7.8.40) and (7.8.41), we first formulate the state vector and observation equa- 


tions: 
a | y@) 1.8 —0.81] | ym —1) 0.1 
a = 7.8.42 


and 12 we + J10v(n) (7.8.43) 
ya— 1) 
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FIGURE 7.11 


TABLE 7.5 
Summary of the Kalman filter algorithm. 


1. Input: 


(a) Signal model parameters: A(n — 1), B(n), Ry (1); n = 0, 1, 2, 


(b) Observation model parameters: H(n), Ry(n);n = 0,1, 2,... 
(c) Observation data: y(n); n = 0, 1,2,... 
2. Initialization: y(0| — 1) = y(—1) = 0; R5( I]—1) =Ry(-1) 


3. Time recursion: Forn =0,1,2,... 


(a) Signal prediction: y(n|n — 1) = A(n — 1)¥(n — I|n — 1) 
(b) Data prediction: x(n|n — 1) = H(n)y(n|n — 1) 


(c) A priori error covariance: 


R5(n|n 1)=A(n DR5( 1|n DAH Cn 1) 4 B@)R,()B" (n) 


(d) Kalman gain: 
K(n) = R5(n|n = 1)H# ()R,,!(n) 
Ry (2) = H(@)R5 (n\n — DH? (n+ Ry(n) 


(e) Signal update: ¥(|n) = ¥(n|n — 1) + K(n)[x(n) — &(|n — 1)] 


(f) A posteriori error covariance: 
Rs (n|n) = [I- K()H(1) |R5 (n|n — 1) 


4. Output: Filtered estimate y(n|n),n = 0,1, 2,... 


Observation 
Signal model model 


The block diagram of the Kalman filter model and algorithm. 


Hence the relevant matrix quantities are 


Correction 


Update y(n|n) 


y(n—1|n-1) 


prediction 


Discrete Kalman filter 


eae 1.8 —0.81 Bin) = 0.1 R 2% 
(n) = i 0 (n) = 0 nt) = 


and Hin) =[1 0] = Ry(n) = 10 (7.8.44) 


Now the Kalman filter equation from Table 7.5 can be implemented with zero initial conditions. 
Note that since the system matrices are constant, the processes x(n) and y(7) are asymptotically 
stationary. 

Using (7.8.40) and (7.8.41), we generated 100 samples of y() and x(n). The observation 
x(n) was processed using the Kalman filter equations to obtain dp (n) = ¥(n|n), and the results 
are shown in Figure 7.12. Owing to a large observation noise variance, the x(m) values are very 
noisy around the signal y(”) values. However, the Kalman filter was able to track x(n) closely 
and reduce the noise v() degradation. In Figure 7.13 we show the evolution of Kalman filter gain 
values Kj(n) and K(n) along with the estimation error variance. The filter reaches its steady 
state in about 20 samples and becomes a stationary filter as expected. In such situations, the 
gain and error covariance equations can be implemented off-line (since these equations are data- 
independent) to obtain a constant-gain matrix. The data then can be filtered using this constant 
gain to reduce on-line computational complexity. 


Amplitude 


Estimation of AR(2) process FIGURE 7.12 
: Estimation of AR(2) process 


using Kalman filter in 
Example 7.8.2. 


0 20 40 60 80 100 


Kalman gain values FIGURE 7.13 
Kalman filter gains and 
estimation error covariance in 


Example 7.8.2. 


Mean square error 


0 20 40 60 80 100 


In the next example, we consider the case of the estimation of position of an object in 


a linear motion subjected to random acceleration. 


EXAMPLE 7.8.3. Consider an object traveling in a straight-line motion that is perturbed by 
random acceleration. Let yp(2) = ye(nT) be the true position of the object at the nth sampling 
instant, where T is the sampling interval in seconds and yc¢(ft) is the instantaneous position. 
This position is measured by a sensor that records noisy observations. Let x(”) be the measured 
position at the nth sampling instant. Then we can model the observation as 


x(n) = yp(n)+v(n) n=O (7.8.45) 


where v(n) ~ WGN(O, a2). To derive the state dynamic equation, we assume that the object is 
in a steady-state motion (except for the random acceleration). Let yy(n) = yc(nT) be the true 
velocity at the nth sampling instant, where y(t) is the instantaneous velocity. Then we have the 


following equations of motion 
(7.8.46) 
(7.8.47) 


y(n) = y(n = 1) + yan — IT 
yp) = yp — 1) +yv(n— DT + Sya(n — YT? 


where we have assumed that the acceleration ¥,(t) is constant over the sampling interval and 
that y(n — 1) is the acceleration over (n — 1)T < t < nT. We now define the state vector as 


y(n) * yee (7.8.48) 
Yv (n) 
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CHAPTER 7 WGN(O, o7) and orthogonal to v(7). Thus (7.8.46) and (7.8.47) can be arranged in vector form 
Algorithms and Structures as 
for Optimum Linear Filters 2 
T 
1 T — 
y(n) = i ; yin—1)+] 2 | n(n) n>=0 (7.8.49) 
T 
Thus we have 
T2 
1 T — 
A= and B=]| 2 
0 1 
T 
Similarly, the observation (7.8.45) is given by 
x(n) = [1 Oly) + v(n) n>=0O (7.8.50) 


and hence H = [1 0]. Let the initial conditions be yp(—1) and yy(—1). Now given the noisy 
observations {x(n)} and all the necessary information [T, ae, oa yp(-1), and yy(—1)], we 
can recursively estimate the position and velocity of the object at each sampling instance. An 
approach similar to this is used in aircraft navigation systems. 

Using the following values 


T=01 of =07=025  yp(-1I)=0  w(-l=1 


we simulated the trajectory of the object over [0, 10] second interval. From Table 7.5 Kalman 
filter equations were obtained, and the true positions as well as velocities were estimated using 
the noisy positions. Figure 7.14 shows the estimation results. The top graph shows the true, 
noisy, and estimated positions. The bottom graph shows the true and estimated velocities. Due to 
random acceleration values (which are moderate), the true velocity has small deviations from the 
constant value of 1 while the true position trajectory is approximately linear. The estimates of the 
position follow the true values very closely. However, the velocity estimates have more errors 
around the true velocities. This is because no direct measurements of velocities are available; 
therefore, the velocity of the object can be inferred only from position measurements. 


True, noisy, and estimated positions 
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FIGURE 7.14 


Estimation of positions and velocities using Kalman filter in Example 7.8.3. 


In Figure 7.15, we show the trajectories of Kalman gain values and trace of the error 
covariance matrices. The top graph contains the gain values corresponding to position (K p) 
and velocity (Ky). The steady state of the filter is reached in about 3 s. The bottom left graph 
contains the a priori and a posteriori error covariances, which also reach the steady-state values 
in 3 s and which appear to be very close to each other. Therefore, in the bottom right graph we 
show an exploded view of the steady-state region over a 1-s interval. It is interesting to note 
that the steady-state error covariances before and after processing an observation are not the 
same. As a result of making an observation, the a posteriori errors are reduced from the a priori 
ones. However, owing to random acceleration, the errors increase during the intervals between 
observations. This is shown as dotted lines in Figure 7.15. The steady state is reached when the 
decrease in errors achieved by each observation is canceled by the increase between observations. 


Kalman gain components 
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Kalman filter gains and estimation error variances in Example 7.8.3. 


It should be clear from the above two examples that the Kalman filter can recursively 
estimate signal values because of the assumption of dynamic models (7.8.14) and (7.8.16). 
Therefore, in this sense, the Kalman filter approach is a special case of the more general 
Wiener filter problem that we considered earlier. In many signal processing applications 
(e.g., data communication systems), assumption of such models is difficult to justify, which 


limits the use of Kalman filters. 


7.9 SUMMARY 


The application of optimum FIR filters and linear combiners involves the following two 


steps. 


e Design. In this step, we determine the optimum values of the estimator parameters by 


solving the normal equations formed by using the known second-order moments. For 
stationary processes the design step is done only once. For nonstationary processes, we 
repeat the design when the statistics change. 
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e Implementation. In this step, we use the optimum parameters and the input data to compute 
the optimum estimate. 


The type and complexity of the algorithms and structures available for the design and 
implementation of linear MMSE estimators depend on two factors: 


e The shift invariance of the input data vector. 
e The stationarity of the signals that determine the second-order moments in the normal 
equations. 


As we introduce more structure (shift invariance or stationarity), the algorithms and 
structures become simpler. From a mathematical point of view, this is reflected in the struc- 
ture of the correlation matrix, which starting from general Hermitian at one end becomes 
Toeplitz at the other. 


Linear combiners 


The input vector is not shift-invariant because the optimum estimate is computed by 
using samples from M different signals. The correlation matrix R is Hermitian and usually 
positive definite. The normal equations are solved by using the LDL? decomposition, 
and the optimum estimate is computed by using the obtained parameters. However, in 
many applications where we need the optimum estimate and not the coefficients of the 
optimum combiner, we can implement the MMSE linear combiner, using the orthogonal 
order-recursive structure shown in Figure 7.1. This structure consists of two parts: (1) a 
triangular decorrelator (orthogonalizer) that decorrelates the input data vector and produces 
its innovations vector and (2) a linear combiner that combines the uncorrelated innovations 
to compute the optimum estimates for all orders 1 <m < M. 


FIR filters and predictors 


In this case the input data vector is shift-invariant, which leads to simplifications, whose 
extent depends on the stationarity of the involved signals. 


Nonstationary case. In general, the correlation matrix is Hermitian and positive defi- 
nite with no additional structure, and the LDL” decomposition is the recommended method 
to solve the normal equations. However, the input shift invariance leads to a remarkable 
coupling between FLP, BLP, and FIR filtering, resulting in a simplified orthogonal order- 
recursive structure, which now takes the form of a lattice ladder filter (see Figure 7.3). The 
backward prediction errors of all orders | < m < M provide the innovations of the input 
data vector. The parameters of lattice structure (decorrelator) are specified by the compo- 
nents of the LDL” decomposition of the input correlation matrix. The coefficients of the 
ladder part (correlator) depend on both the input correlation matrix and the cross-correlation 
between the desired response and the input data vector. 


Stationary case. In this case, the addition of stationarity to the shift invariance makes 
the correlation matrix Toeplitz. The presence of the Toeplitz structure has the following 
consequences: 


1. The development of efficient order-recursive algorithms, with computational complexity 
proportional to M7, for the solution of the normal equations and the triangularization of 
the correlation matrix. 


a. Levinson algorithm solves Re = d for arbitrary right-hand side vector d (2M? op- 
erations). 

b. Levinson-Durbin algorithm solves Ra = —r* when the right-hand side has special 
structure (M2 operations). 

c. Schiir algorithm computes directly the lattice-ladder parameters from the autocorre- 
lation and cross-correlation sequences. 


2. The MMSE FLP, BLP, and FIR filters are time-invariant; that is, their coefficients (direct- 389 


form or lattice-ladder structures) are constant and should be computed only once. PROBLEMS 


The algorithms for MMSE filtering and prediction of stationary processes are the sim- 
plest ones. However, we can also develop efficient algorithms for nonstationary processes 
that have special structure. There are two cases of interest: 


e The Kalman filtering algorithm that can be used for processes generated by a state-space 
model with known parameters. 

e Algorithms for a-stationary processes, that is, processes whose correlation matrix is near 
to Toeplitz, as measured by a special distance known as the displacement rank (Morf et 
al. 1977). 


PROBLEMS 


7.1 By first computing the matrix product 


and then the determinants of both sides, prove Equation (7.1.25). Another proof, obtained using 
the LDL? decomposition, is given by Equation (7.2.4). 


7.2. Prove the matrix inversion lemma for lower right corner partitioned matrices, which is described 
by Equations (7.1.26) and (7.1.28). 
7.3 This problem generalizes the matrix inversion lemmas to nonsymmetric matrices. 


(a) Show that if R—! exists, the inverse of an upper left corner partitioned matrix is given by 


R r]! t[eR7!+wv? w 
roo ~ a | yF 1 


where Rw £ —-r 
R’y4-¥ 


a4o0-fFROIrsactvr=o +r w 


(b) Show that if R—! exists, the inverse of a lower right corner partitioned matrix is given by 


—1 
ao rl 1fi vw? 
r R a w aR-!+wv! 


where Rw 4 —-r 
R’y4-7 
a4o-fROIrpa=ot¢vr=o+t+r' w 


(c) Check the validity of the lemmas in parts (a) and (b), using MATLAB. 


7.4 Develop an order-recursive algorithm to solve the linear system in Example 7.1.2, using the 
lower right corner partitioning lemma (7.1.26). 


7.5 In this problem we consider two different approaches for inversion of symmetric and positive 
definite matrices by constructing an arbitrary fourth-order positive definite correlation matrix 
R and comparing their computational complexities. 


(a) Given that the inverse of a lower (upper) triangular matrix is itself lower (upper) triangular, 
develop an algorithm for triangular matrix inversion. 
(b) Compute the inverse of R, using the algorithm in part (a) and Equation (7.1.58). 
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7.6 


7.7 


78 


7.9 


7.10 


711 


7.12 


7.13 


7.14 


7.15 


7.16 


7.17 


7.18 


7.19 


(c) Build up the inverse of R, using the recursion (7.1.24). 
(d) Estimate the number of operations for each method as a function of order M, and check 
their validity for M = 4, using MATLAB. 


Using the appropriate orthogonality principles and definitions, prove Equation (7.3.32). 
Prove Equations (7.3.36) to (7.3.38), using Equation (7.1.45). 


Working as in Example 6.3.1, develop an algorithm for the upper-lower decomposition of a 
symmetric positive definite matrix. Then use it to factorize the matrix in Example 6.3.1, and 
verify your results, using the function [U,D]=udut (R). 


In this problem we explore the meaning of the various quantities in the decomposition R = upu” 
of the correlation matrix. 


(a) Show that the rows of A = U~! are the MMSE estimator of xm from Xm+1>Xm+2)++++XM- 


(b) Show that the decomposition R = ubu” can be obtained by the Gram-Schmidt orthog- 
onalization process, starting with the random variable xj, and ending with x1, that is, 
proceeding backward. 


In this problem we clarify the various quantities and the form of the partitionings involved in 
the UDU? decomposition, using an m = 4 correlation matrix. 


(a) Prove that the components of the forward prediction error vector (7.3.65) are uncorrelated. 

(b) Writing explicitly the matrix R, identify and express the quantities in Equations (7.3.62) 
through (7.3.67). 

(c) Using the matrix R in Example 6.3.2, compute the predictors in (7.3.67) by using the 
corresponding normal equations, verify your results, comparing them with the rows of matrix 
A computed directly from the LDL” decomposition of R~! or the UDU” decomposition 
of R (see Table 7.1). 


Given an all-zero lattice filter with coefficients kg and k,, determine the MSE P(kg, kj) as a 
function of the required second-order moments, assumed jointly stationary, and plot the error 


performance surface. Use the statistics in Example 6.2.1. 


Given the autocorrelation r(0) = 1,r(_1) = r(2) = 5, and r(3) = i determine all possible 


representations for the third-order prediction error filter (see Figure 7.7). 
Repeat Problem 7.12 for kg = ky = kp = 7 and P3 = (3). 


Use Levinson’s algorithm to solve the normal equations Re = d where R = Toeplitz{3, 2, 1} 
andd = [662]. 


Consider a random sequence with autocorrelation {r(/ 8 = {1,0.8, 0.6, 0.4}. (a) Determine 


the FLP a,, and the corresponding error Pf for m = 1, 2,3. (b) Determine and draw the flow 
diagram of the third-order lattice prediction error filter. 


Using the Levinson-Durbin algorithm, determine the third-order linear predictor a3 and the 
MMSE P} for a signal with autocorrelation r(0) = 1, r(1) = r(2) = 4, and r(3) = }. 


Given the autocorrelation sequence r(0) = 1,r(1) = r(2) = 5, and r(3) = i, compute the 
lattice and direct-form coefficients of the prediction error filter, using the algorithm of Schiir. 


Determine p, and p2 so that the matrix R = Toeplitz{1, e1, 02} is positive definite. 


Suppose that we want to fit an AR(2) model to a sinusoidal signal with random phase in additive 
noise. The autocorrelation sequence is given by 


2 
r(l) = Po cos aol + a7 d(l) 


7.20 


7.21 


7.22 


7.23 


7.24 


7.25 


7.26 


7.27 


7.28 


(a) Determine the model parameters ae : al , and as interms of Po, wo, and ee . (b) Determine 


the lattice parameters of the model. (c) What are the limiting values of the direct and lattice 
parameters of the model when o2 > 0? 


Given the parameters r(0) = 1,k9 = ky = 5, and kz = 
representations of the prediction error filter (see Figure 7.7). 


i determine all other equivalent 


Let {ro be samples of the autocorrelation sequence of a stationary random signal x(n). 
(a) Is it possible to extend r(/) for |/| > P so that the PSD 


. ia . 
RENE Seer 


l=—00 


is valid, that is, R(e/®) > 02 (b) Using the algorithm of Levinson-Durbin, develop a procedure 
to check if a given autocorrelation extension is valid. (c) Use the algorithm in part () to find 
the necessary and sufficient conditions so that r(0) = 1, r(1) = 4, and r(2) = p9 are a valid 
autocorrelation sequence. Is the resulting extension unique? 


Justify the following statements. (a) The whitening filter for a stationary process x(n) is time- 
varying. (b) The filter in part (a) can be implemented by using a lattice structure and switching 
its stages on one by one with the arrival of each new sample. (c) If x(”) is AR(P), the whitening 
filter becomes time-invariant P + 1 sampling intervals after the first sample is applied. Note: 
We assume that the input is applied to the filter at n = 0. If the input is applied at n = —oo, the 
whitening filter of a stationary process is always time-invariant. 


Given the parameters r(0) = 1, kg = 7 k= 
matrix Ry = Toeplitz{r(0), (1), r(2), r(3)}. 


i , compute the determinant of the 


(a) Determine the lattice second-order prediction error filter (PEF) for a sequence x(n) with 
autocorrelation r(/) = Ou I. (b) Repeat part (a) for the sequence y(n) = x(n) + v(m), where 
v(n) ~ WN(0, 0.2) is uncorrelated to x(n). (c) Explain the change in the lattice parameters 
using frequency domain reasoning (think of the PEF as a whitening filter). 

Consider a prediction error filter specified by P3 = iy, ko = i kj = 5, and ky = i 
(a) Determine the direct-form filter coefficients. (b) Determine the autocorrelation values r(1), 
r(2), and r(3). (c) Determine the value r (4) so that the MMSE P, for the corresponding fourth- 
order filter is the minimum possible. 


Consider a prediction error filter Ayg(z) = 1+ ge tere t+ a ees with lattice para- 
meters kj, ko,..., ky. (a) Show that if we set km = (—1)"km, then a” = (-1y"a\. 


(b) What are the new filter coefficients if we set kin = p'km, where p is a complex number 
with |o| = 1? What happens if |p| < 1? 


Suppose that we are given the values {r(/ sens i of an autocorrelation sequence such that the 
Toeplitz matrix Rj, is positive definite. (a) Show that the values of r(m) such that R,,+1 is 
positive definite determine a disk in the complex plane. Find the center a and the radius €,, 


m-1 
, that 


of this disk. (b) By induction show that there are infinitely many extensions of {r(/)}"), Hy 


make {r(/)}°°,, a valid autocorrelation sequence. 
Consider the MA(1) sequence x(n) = w(n) +d, w(n — 1), w(n) ~ WN(O, ae) (a) Show that 
det Ry, = r(0) det R,,_1) — Ir(1)|7Ryy—2 m>2 
(b) Show that ky», = —r’™ (1)/ det Ry», and that 
1 r(0) 1 r*(_1) 1 


km ~ rl) km-1 r(1) km—2 
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7.29 


7.30 


7.31 


7.32 


7.33 


7.34 


7.35 


7.36 


7.37 


7.38 


7.39 


(c) Determine the initial conditions and solve the recursion in (b) to show that 


_ d= ld?) 
en ee ae 


which tends to zero as m > oo. 
Prove Equation (7.6.6) by exploiting the symmetry property bm = Ja%,. 


In this problem we show that the lattice parameters can be obtained by “feeding” the autocor- 
relation sequence through the lattice filter as a signal and switching on the stages one by one 
after the required lattice coefficient is computed. The value of kj, is computed at time n = m 
from the inputs to stage m. (a) Using (7.6.10), draw the flow diagram of a third-order lattice 
filter that implements this algorithm. (b) Using the autocorrelation sequence in Example 7.6.1, 
“feed” the sequence {r(n)}B = (3,2, 1, 5} through the filter one sample at a time, and compute 
the lattice parameters. Hint: Use Example 7.6.1 for guidance. 


Draw the supperlattice structure for M = 8, and show how it can be partitioned to distribute 
the computations to three processors for parallel execution. 


Derive the superladder structure shown in Figure 7.10. 


Extend the algorithm of Schiir to compute the LDL# decomposition of a Hermitian Toeplitz 
matrix, and write a MATLAB function for its implementation. 


Given the matrix R3 = Toeplitz{1, s +), use the appropriate order-recursive algorithms to 


compute the following: (a) The LDL? and UDU# decompositions of R, (b) the LDL? and 
UDU# decompositions of R7! and (c) the inverse matrix R7!. 


Consider the AR(1) process x(n) = px(n — 1) + w(n), where w(n) ~ WN(O, 0%.) and —1 < 
p < 1. (a) Determine the correlation matrix Rjyy+ of the process. (b) Determine the Mth-order 


FLP, using the algorithm of Levinson-Durbin. (c) Determine the inverse matrix Riis using 
the triangular decomposition discussed in Section 7.7. 


If r() = cos q@ol, determine the second-order prediction error filter and check whether it is 
minimum-phase. 


Show that the MMSE linear predictor of x(n + D) in terms of x(n), x(n—1),...,x(1-M+1) 
for D > 1 is given by 
Ra) — —-) 


where r(P) = [r(D) r(D4+1) --- r(D4+M— DI’. Develop arecursion that computes a(D+1) 
from a?) by exploring the shift invariance of the vector r), See Manolakis et al. (1983). 


The normal equations for the optimum symmetric signal smoother (see Section 6.5.1) can be 
written as 


0 
Rom4102m41 = | Pam4i 
0 
G _ * (2m+1) _ ‘ “ % 
where P24 1 is the MMSE, cam+41 = Jn and Cy = 1. (a) Using a “central 


partitioning of R»,,43 and the persymmetry property of Toeplitz matrices, develop a recursion 
to determine ¢2,,43 from ¢2,,41. (b) Develop a complete order-recursive algorithm for the 
computation of {€2741, Pam+1 yy (see Kok et al. 1993). 


Using the triangular decomposition of a Toeplitz correlation matrix, show that (a) the forward 
prediction errors of various orders and at the same time instant, that is, 


ef (n) = [eh(n) ef (n) --- a]? 


7.40 


7.41 


7.42 


are correlated and (b) the forward prediction errors 
é (n) = [eb (n) y_j—1) --: eh — MM)" 


are uncorrelated. 


Generalize the inversion algorithm described in Section 7.7.3 to handle Hermitian Toeplitz 
matrices. 


Consider the estimation of a constant a from its noisy observations. The signal and observation 
models are given by 


y(an+1)= y(n) n>0O y(0) =a 
x(n) = y(n) + v(n) v(n) ~ WGN(0, 03) 
(a) Develop scalar Kalman filter equations, assuming the initial condition on the a posteriori 
error variance R;(0|0) equal to rg. 
(b) Show that the a posteriori error variance R53 (n|n) is given by 
TO 
1+ (ro/o5)n 


(c) Show that the optimal filter for the estimation of the constant @ is given by 


R3(n|n) = (P.1) 


1/02 
14+ (r9/o2)n 


Im) = Ja-Y+ [x(n) — J — 1] 


Consider a random process with PSD given by 
= 4 
~ 2.4661 — 1.629 cos w + 0.81 cos 2w 


(a) Using MATLAB, plot the PSD Rs (e/®) and determine the resonant frequency wo. 
(b) Using spectral factorization, develop a signal model for the process of the form 


y(n) = Ay(n — 1) + Bn) 
s(n) =[1 Oly) 


where y(n) is a2 x 1 vector, n(n) ~ WGN(0, 1), and A and B are matrices with appropriate 
dimensions. 
(c) Let x(n) be the observed values of s() given by 


x(n) = s(n) + v(n) u(n) ~ WGN(O, 1) 


Rs (e/”) 


Assuming reasonable initial conditions, develop Kalman filter equations and implement 
them, using MaTLas. Study the performance of the filter by simulating a few sample func- 
tions of the signal process s(n) and its observation x (7). 


7.43 Alternative form of the Kalman filter. A number of different identities and expressions can be 


obtained for the quantities defining the Kalman filter. 
(a) By manipulating the last two equations in (7.8.39) show that 
R53 (a|[n) = R53 (@|n — 1) — R5(|n — IH" (n) 


(P.2) 
x [H(n)R5(n|n — IH (n) + Ry(n)]~ HR; (n|n — 1) 
(b) If the inverses of R5 (nln), R5(m|n — 1), and Ry exist, then show that 
R5 (n\n) = Rl(nin — 1) +H? RZ! ()H@n) (P.3) 


This shows that the update of the error covariance matrix does not require the Kalman gain 
matrix (but does require matrix inverses). 
(c) Finally show that the gain matrix is given by 


K(n) = R5(n\n)H" (n)R, (0) (P.4) 


which is computed by using the a posteriori error covariance matrix. 
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7.44 


7.45 


In Example 7.8.3 we assumed that only the position measurements were available for esti- 
mation. In this problem we will assume that we also have a noisy sensor to measure velocity 
measurements. Hence the observation model is 


vena xp(r) |__| yp(n) + v1) ig 
~ | xy(n)} | yy(n) + v2(n) 


where v1(n) and v2(n) are two independent zero-mean white Gaussian noise sources with 


variances oj, and o7,,, respectively. 


(a) Using the state vector model given in Example 7.8.3 and the observation model in (P.5), 
develop Kalman filter equations to estimate position and velocity of the object at each n. 
(b) Using the parameter values 


T=01 03, =03,=07,=025 yi (-l=0 yl(-l=1 


simulate the true and observed positions and velocities of the object. Using your Kalman 
filter equations, generate plots similar to the ones given in Figures 7.14 and 7.15. 
(c) Discuss the effects of velocity measurements on the estimates. 


In this problem, we will assume that the acceleration ya(m) is an AR(1) process rather than a 
white noise process. Let ya(n) be given by 


ya(n) =aya(n—1)+n(n) n(n) ~WGN(,07) —ya(-1) =0 6) 


(a) Augment the state vector y(7) in (7.8.48), using variable ya(m), and develop the state vector 
as well as the observation model, assuming that only the position is measured. 
(b) Using the above model and the parameter values 


FeOl o209 ° of =67=025 
y(-D=0 ywR-D=!1 — ya(-)=0 
simulate the linear motion of the object. Using Kalman filter equations, estimate the position, 
velocity, and acceleration values of the object at each n. Generate performance plots similar 
to the ones given in Figures 7.14 and 7.15. 


(c) Now assume that noisy measurements of yy(7) and ya(n) are also available, that is, the 
observation model is 


Xp(n) yp(n) + v4 (n) 
x(n) = | xy(n) | = | yy(n) + v2(n) (P.7) 
Xa(n) Ya(n) + v3(n) 


where vq (1), v2(n), and v3 (7) are IID zero-mean white Gaussian noise sources with variance 
Ge: Repeat parts (a) and (b) above. 


CHAPTER 8 


Least-Squares Filtering and Prediction 


In this chapter, we deal with the design and properties of linear combiners, finite impulse 
response (FIR) filters, and linear predictors that are optimum in the least-squares error 
(LSE) sense. The principle of least squares is widely used in practice because second-order 
moments are rarely known. In the first part of this chapter (Sections 8.1 through 8.4), we 
concentrate on the design, properties, and applications of least-squares (LS’) estimators. 
Section 8.1 discusses the principle of LS estimation. The unique aspects of the different 
implementation structures, starting with the general linear combiner followed by the FIR 
filter and predictor, are treated in Sections 8.2 to 8.4. In the second part (Sections 8.5 to 
8.7), we discuss various numerical algorithms for the solution of the LSE normal equations 
and the computation of LSE estimates including QR decomposition techniques (House- 
holder reflections, Givens rotations, and modified Gram-Schmidt orthogonalization) and 
the singular value decomposition (SVD). 


8.1 THE PRINCIPLE OF LEAST SQUARES 


The principle of least squares was introduced by the German mathematician Carl Friedrich 
Gauss, who used it to determine the orbit of the asteroid Ceres in 1821 by formulating the 
estimation problem as an optimization problem. 

The design of optimum filters in the minimum mean square error (MMSE) sense, 
discussed in Chapter 6, requires the a priori knowledge of second-order moments. However, 
such statistical information is simply not available in most practical applications, for which 
we can only obtain measurements of the input and desired response signals. To avoid this 
problem, we can (1) estimate the required second-order moments from the available data 
(see Chapter 5), if possible, to obtain an estimate of the optimum MMSE filter, or (2) design 
an optimum filter by minimizing a criterion of performance that is a function of the available 
data. 

In this chapter, we use the minimization of the sum of the squares of the estimation error 
as the criterion of performance for the design of optimum filters. This method, known as 
least-squares error (LSE ) estimation, requires the measurement of both the input signal and 
the desired response signal. A natural question arising at this point is, What is the purpose 
of estimating the values of a known, desired response signal? There are several answers: 


"A note about abbreviations used throughout the chapter: The two acronyms LSE and LS will be used almost 
interchangably. Although LSE is probably the more accurate term, LS has become a standard reference to LSE 
estimators. 
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1. In system modeling applications, the goal is to obtain a mathematical model describing 
the input-output behavior of an actual system. A quality estimator provides a good model 
for the system. The desired result is the estimator or system model, not the actual estimate. 

2. In linear predictive coding, the useful result is the prediction error or the respective 
predictor coefficients. 

3. Inmany applications, the desired response is not available (e.g., digital communications). 
Therefore, we do not always have a complete set of data from which to design the LSE 
estimator. However, if the data do not change significantly over a number of sets, then 
one special complete set, the training set, is used to design the estimator. The resulting 
estimator is then applied to the processing of the remaining incomplete sets. 


The use of measured signal values to determine the coefficients of the estimator leads to 
some fundamental differences between MMSE and LSE estimation that are discussed where 
appropriate. 

To summarize, depending on the available information, there are two ways to design 
an optimum estimator: (1) If we know the second-order moments, we use the MMSE 
criterion and design a filter that is optimum for all possible sets of data with the same 
statistics. (2) If we only have a block of data, we use the LSE criterion to design an estimator 
that is optimum for the given block of data. Optimum MMSE estimators are obtained by 
using ensemble averages, whereas LSE estimators are obtained by using finite-length time 
averages. For example, an MMSE estimator, designed using ensemble averages, is optimum 
for all realizations. In contrast, an LSE estimator, designed using a block of data from a 
particular realization, depends on the numerical values of samples used in the design. If 
the processes are ergodic, the LSE estimator approaches the MMSE estimator as the block 
length of the data increases toward infinity. 


8.2 LINEAR LEAST-SQUARES ERROR ESTIMATION 


We start with the derivation of general linear LS filters that are implemented using the linear 
combiner structure described in Section 6.2. A set of measurements of the desired response 
y(n) and the input signals x,(7) for 1 < k < M has been taken forO <n < N — 1. Asin 
optimum MMSE estimation, the problem is to estimate the desired response y(n) using the 
linear combination 
M 
Sin) = S~ cf (n) xe(n) = 4 (n) x(n) (8.2.1) 


k=1 


We define the estimation error as 
e(n) = y(n) — $(n) = y(n) — e (n) x(n) (8.2.2) 


and the coefficients of the combiner are determined by minimizing the sum of the squared 
errors 


N-1 
E£Y le)? (8.2.3) 
n=0 


that is, the energy of the error signal. For this minimization to be possible, the coefficient 
vector ¢(n) should be held constant over the measurement time intervalO <n < N —1. 
The constant vector cj, resulting from this optimization depends on the measurement set 
and is known as the linear LSE estimator. In the statistical literature, LSE estimation is 
known as linear regression, where (8.2.2) is called a regression function, e(n) are known 
as residuals (leftovers), and c(7) is the regression vector (Montgomery and Peck 1982). 


The system of equations in (8.2.2), or equivalently e*(n) = y*(n) — x(n) ¢, can be 
written in matrix form as 


e*(0) y*(0) 
e*(1) y*(1) 
e(N-)] Lyrtw-D 
‘ ‘ r (8.2.4) 
x7 (0) x3 (0) +++ X4_(0) Cl 
7 x7) x3 (1) es x9) c2 
xi(N—-1) x3(N-1) --- x4,(N—-1) | Leu 
or more compactly as 
e=y-—Xc (8.2.5) 
where 
e = [e(0) e(1) --- e(N—1)]7 error data vector (N x 1) 
y 4 [vy(O) yd) --- y¥N 1]? desired response vector (NV x 1) (8.2.6) 
X 4 [x(0) x(1) --- x(N — 1)]7 input data matrix (N x M) = 
c# [ep co ++: ey)! combiner parameter vector (M x 1) 


are defined by comparing (8.2.4) to (8.2.5). The input data matrix X can be partitioned 
either columnwise or rowwise as follows: 


x" (0) 


x(1) 
X © [&), %,..., Xv] = (8.2.7) 


x4 (N — 1) 
where the columns x; of X 
Ki © [xn (0) xe) «+ (NN — DIM 
will be called data records and the rows 
x(n) © [x1 (m2) x2(n) «+» x(n)" 


will be called snapshots. Both of these partitionings of the data matrix, which are illustrated 
in Figure 8.1, are useful in the derivation, interpretation, and computation of LSE estimators. 

The LSE estimator operates in a block processing mode; that is, it processes a frame of 
N snapshots using the steps shown in Figure 8.2. The input signals are blocked into frames 
of N snapshots with successive frames overlapping by No samples. The values of N and 
No depend on the application. The required estimate or residual signals are unblocked at 
the final stage of the processor. 

If we sete = 0, we have a set of N equations with M unknowns. If N = M, then (8.2.4) 
usually has a unique solution. For N > M, we have an overdetermined system of linear 
equations that typically has no solution. Conversely, if N < M, we have an underdetermined 
system that has an infinite number of solutions. However, even if M > N or N > M, the 
system (8.2.4) has a natural, unique, least-squares solution. We next focus our attention on 
overdetermined systems since they play a very important role in practical applications. The 
underdetermined least-squares problem is examined in Section 8.7.2. 
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> 
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FIGURE 8.1 


The columns of the data matrix are the records of data collected at each 
input (sensor), whereas each row contains the samples from all inputs at 
the same instant. 


N No 


*(”) Compute and Compute 
solve normal Sy! estimates or 


equations residuals 


Frame 
unblocking 


Frame 


xy(n) : blocking 


y”) 


FIGURE 8.2 
Block processing implementation of a general linear LSE estimator. 


8.2.1 Derivation of the Normal Equations 
We provide an algebraic and a geometric solution to the LSE estimation problem; a calculus- 
based derivation is given in Problem 8.1. 


Algebraic derivation. The energy of the error can be written as 


E=ete=(y4 —c!X")\y — Xe) 


= yy — cl X"y —y?Xe4+c7X" Xe (8.2.8) 
= Ey agg = dc+e%Re 
N-1 
where Bay" y=. )° ar (8.2.9) 
n=0 
N-1 
RAx?x= De x(n)x" (n) (8.2.10) 
n=0 
N-1 
d4x"y= » x(n) y*(n) (8.2.11) 


Note that these quantities can be viewed as time-average estimates of the desired response 
power, correlation matrix of the input data vector, and the cross-correlation vector between 
the desired response and the data vector, when these quantities are divided by the number 
of data samples N. 

We emphasize that all formulas derived for the MMSE criterion hold for the LSE cri- 
terion if we replace the expectation E{(-)} with the time-average operator (1/N) ai (-). 
This results from the fact that both criteria are quadratic cost functions. Therefore, working 
as in Section 6.2.2, we conclude that if the time-average correlation matrix R is positive 
definite, the LSE estimator c), is provided by the solution of the normal equations 


Re, =d (8.2.12) 
and the minimum sum of squared errors is given by 
E\, = Ey —d#R-'d = Ey — d*q, (8.2.13) 
Since R is Hermitian, we only need to compute the elements 
rig = KP; (8.2.14) 


in the upper triangular part, which requires M(M + 1)/2 dot products. The right-hand side 
requires M dot products 


dj =xly (8.2.15) 


Note that each dot product involves N arithmetic operations, each consisting of one multi- 
plication and one addition. Thus, to form the normal equations requires a total of 


SM(M +1)N+MN=5M?N+3MN (8.2.16) 


arithmetic operations. When Ris nonsingular, which is the case when Ris positive definite, 
we can solve the normal equations using either the LDL” or the Cholesky decomposition 
(see Section 6.3). However, it should be stressed at this point that most of the computational 
work lies in forming the normal equations rather than their solution. The formulation of the 
overdetermined LS equations and the normal equations is illustrated graphically in Figure 
8.3. The solution of LS problems has been extensively studied in various application areas 
and in numerical analysis. The basic methods for the solution of the LS problem, which are 
discussed in this book, are shown in Figure 8.4. We just stress here that for overdetermined 
LS problems, well-behaved data, and sufficient numerical precision, all these methods 
provide comparable results. 


Geometric derivation. WWe may think of the desired response record y and the data 
records xz, | < k < M, as vectors in an N-dimensional vector space, with the dot product 
and length defined by 


N-1 
(x, 5) Sa ey = Yo @) x*(n) (8.2.17) 
n=0 
N-1 
and I? = &, &) = Do lx)? = Ey (8.2.18) 
n=0 
respectively. The estimate of the desired response record can be expressed as 
M 
y=Xce= Ss CLK (8.2.19) 
k=1 


that is, as a linear combination of the data records. 

The M vectors x, form an M-dimensional subspace, called the estimation space, which 
is the column space of data matrix X. Clearly, any estimate y must lie in the estimation space. 
The desired response record y, in general, lies outside the estimation space. The estimation 
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Least-squares equations Normal equations 


x c y x4X cj, xy 


MxM = 
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x 
xt 
x? i 
x = 
x; 
y 
xH d=Xx!y 
x i fd; =x!y 
x = 
FIGURE 8.3 


The LS problem and computation of the normal equations. 


space for M = 2 and N = 3 is illustrated in Figure 8.5. The error vector e points from the 
tip of y to the tip of y. The squared length of e is minimum when e is perpendicular to the 
estimation space, that is,e | x, forl <k < M. 
Therefore, we have the orthogonality principle 

(xe) =k7e=0 1<k<M (8.2.20) 

or more compactly 
X4%e =X" (y — Xe) =0 

or (X?X)e,= X"%y (8.2.21) 
which we recognize as the LSE normal equations from (8.2.12). 


The LS solution splits the desired response y into two orthogonal components, namely, 
Ji; and e),. Therefore, 


lly I? = [9isll? + llersll? (8.2.22) 
and, using (8.2.18) and (8.2.19), we have 


E\s = Ey — cH X? Xe, = Ey — of X"#y (8.2.23) 


LS computations 
Data: {X, y} 


Power domain: Amplitude domain: 
use normal equations work directly with data 
(X?X)e), = X"%y {X, y} 


QR Singular value 
decomposition decomposition 


Householder- Gram-Schmidt 


reflection 


’ ’ orthogonalization 
Givens rotation 


FIGURE 8.4 
Classification of different computational algorithms for the solution of 
the LS problem. 


FIGURE 8.5 


Vector space interpretation of LSE estimation 
for N = 3 (dimension of data space) and 
M = 2 (dimension of estimation subspace). 


(8.2.24) 


which is in the range 0 < € < 1, with limits of 0 and 1, which correspond to the worst and 


best cases, respectively. 


Uniqueness. The solution of the LSE normal equations exists and is unique if the 


time-average correlation matrix R is invertible. We shall prove the following: 


THEOREM 8.1. The time-average correlation matrix R = X"X is invertible if and only if the 
columns x; of X are linearly independent, or equivalently if and only if R is positive definite. 
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Proof. If the columns of X are linearly independent, then for every z 4 0 we have Xz # 0. 
This implies that for every z 4 0 


2! (X"4X)z = (X2z)" Xz = ||Xz\/? > 0 (8.2.25) 


that is, R is positive definite and hence nonsingular. 
If the columns of X are linearly dependent, then there is a vector z) 4 0 such that Xzp = 0. 
Therefore, X“Xzp = 0, which implies that R = XX is singular. 


For a matrix to have linearly independent columns, the number of rows should be 
equal to or larger than the number of columns; that is, we must have more equations than 
unknowns. To summarize, the overdetermined (N > M) LS problem has a unique solution 
provided by the normal equations in (8.2.12) if the time-average correlation matrix R is 
positive definite, or equivalently if the data matrix X has linearly independent columns. 

In this case, the LS solution can be expressed as 


cis = Xty (8.2.26) 
where xt 4 (x7 xX)! x4 (8.2.27) 


isan M x N matrix known as the pseudo-inverse or the Moore-Penrose generalized inverse 
of matrix X (Golub and Van Loan 1996; Strang 1980). 
The LS estimate y), of y can be expressed as 


Yis = Py (8.2.28) 
where P4 X(K"?X) 1x? (8.2.29) 


is known as the projection matrix because it projects the data vector y onto the column 
space of X to provide the LS estimate y,, of y. Similarly, the LS error vector e;, can be 
expressed as 


e;, = (I — P)y (8.2.30) 


where Tis the N x N identity matrix. The projection matrix P is Hermitian and idempotent, 
that is, 


pP=p? (8.2.31) 
and Pp? =Pp’p=p (8.2.32) 


respectively. 

When the columns of X are linearly dependent, the LS problem has many solutions. 
Since all these solutions satisfy the normal equations and the orthogonal projection of y 
onto the column space of X is unique, all these solutions produce an error vector e of equal 
length, that is, the same LSE. This subject is discussed in Section 8.6.2 (minimum-norm 
solution). 


EXAMPLE 8.2.1. Suppose that we wish to estimate the sequence y = [1 2 3 2]? from the 
observation vectors xj = [1 2 1 yt and x» = [212 3]". Determine the optimum filter, the 
error vector e},, and the LSE Ej,. 


Solution. We first compute the quantities 


[; i) [2 i E i 
sip Be ae! (12 a eg: 8 ee ee a 
a oe 2 E 2 -[5 i boas aa 2 

i 3 en -3 Bl Me 


and we then solve the normal equations Rey, = d to obtain the LS estimator 


Cy = R-'d = 


Ul ul 
Inu 
jen” ek 
no 
os 

| 
als 


and the LSE 
T| 4 
- 10 98 
Np ad adda ome 
E\, = Ey d Cy = 18 Le 2 45 
45 
The projection matrix is 
2 1 2 L 
9 9. 39 a) 
1 4 1 2 
p= X(X! x)! x7 = 9 45 9 15 | 
2 1 2 L 
9 9 9 3 
1 2 1 3 
3 15 3 5 


which can be used to determine the error vector 
7 4 11 47 
9 45 9 is! 
98 


whose squared norm is equal to ||ej, \|? = 75 = Ejs, as expected. We can also easily verify the 


T 
Is 


eq, =y—Py=[ 


orthogonality principle efx =e, x. = 0. 

Weighted least-squares estimation. The previous results were derived by using an LS 
criterion that treats every error e(n) equally. However, based on a priori information, we 
may wish to place greater importance on different errors, using the weighted LS criterion 


N-1 
Ey = ~~ w(n)|e(n)|? = e4 We (8.2.33) 
n=0 
where W 4 diag{w(0), w(1),..., w(N — 1} (8.2.34) 


is a diagonal weighting matrix with positive elements. Usually, we choose small weights 
where the errors are expected to be large, and vice versa. Minimization of E,,, with respect 
to c yields the weighted LS (WLS) estimator 


ewis = (XK WX) 1X" Wy (8.2.35) 


assuming that the inverse of the matrix X WX exists. We can easily see that when W = I, 
then Cywis = Cis. The criterion in (8.2.33) can be generalized by choosing W to be any 
Hermitian, positive definite matrix (see Problem 8.2). 


8.2.2 Statistical Properties of Least-Squares Estimators 


A useful approach for evaluating the quality of an LS estimator is to study its statistical 
properties. Toward this end, we assume that the obtained measurements y actually have 
been generated by 


y = Xe, +e, (8.2.36) 


where ey is the random measurement error vector. We may think of ¢, as the “true” parameter 
vector. Using (8.2.36), we see that (8.2.21) gives 


Cjs = Cy + (XX)! xe, (8.2.37) 
We make the following assumptions about the random measurement error vector ey: 


1. The error vector e, has zero mean 


E{e,} =0 (8.2.38) 
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2. The error vector e, has uncorrelated components with constant variance oo. that is, the 
correlation matrix is given by 


Re, = E{ee#} = 021 (8.2.39) 
3. There is no information about e, contained in data matrix X; that is, 
E{eo|X} = E{eo} = 0 (8.2.40) 


4. If X is a deterministic N x M matrix, then it has rank M. This means that X is a 
full-column rank and that XX is invertible. If X is a stochastic N x M matrix, then 
E{(X7X)~!} exists. 


In the following analysis, we consider two possibilities: X is deterministic and stochas- 
tic. Under these conditions, the LS estimator c), has several desirable properties. 


Deterministic data matrix 


In this case, we assume that the LS estimators are obtained from the deterministic data 
values; that is, the matrix X is treated as a matrix of constants. Then the properties of the LS 
estimators can be derived from the statistical properties of the random measurement error 
vector €y. 


PROPERTY 8.2.1. The LS estimator cj, is an unbiased estimator of C9, that is, 
E{e\s} = Co (8.2.41) 
Proof. Taking the expectation of both sides of (8.2.37), we have 
E {es} = E{eo} + X4X)'X4 Efe} = co 
because X is deterministic and E{e,} = 0. 


PROPERTY 8.2.2. The covariance matrix of ¢), corresponding to the error ¢j, — €g is 


Tis = E{(e\s — Co) (€1s — C0)" } = 03 (XX)! = 02 RT! (8.2.42) 


Proof. Using (8.2.37), (8.2.39), and the definition (8.2.42), we easily obtain 
Ty = (XX) 1X" Efeged! }X(K"X)! = 02 (XYX)! 


Note that the diagonal elements of matrix ee R-! are also equal to the variance of the 
LS combiner vector cjg. 


PROPERTY 8.2.3. An unbiased estimate of the error variance Cod is given by 
E 
a2 Is 
= (8.2.43) 
“o  N-M 


where JN is the number of observations, M is the number of parameters, and E), is the LS error. 
Proof. Using (8.2.30) and (8.2.36), we obtain 
es = I- Py = I— Pheo 


which results in 


E\s = ele, =e! I — Py” 1 — Pye, =e! I— Phe, 
because of (8.2.32). Since Ej, depends on eg, it is a random variable whose expected value is 
E{E\s} = Efes! I — Pyeo} = E {tri — P)ege?! }} 
= tr[(I — P)E{ege#}] = 02 td —P) 
since tr(AB) = tr (BA), where tr is the trace function. However, 
tr — P) = wl — X(X"x)-!x"] 
=tr[Iyxy — (X74 X)7!x? x] 
= tr(Ivxw) — o(X" x)! x" x] 


=t(yyn) -—trduxu)=N-M 


os E{E\s} 


therefore aaa aT, (8.2.44) 
which proves that 62 is an unbiased estimate of ae 
Similar to (8.2.41), the mean value of cyj. is 

E{ewis} = E{eo} + (X4 WX) |X? WE{e,} = Efeo} (8.2.45) 


that is, the WLS estimator is an unbiased estimate of c,. The covariance matrix of Cyjs is 
Twis = (X” WX) |X” WR,, WX(X" Wx)! (8.2.46) 


where R,,, is the correlation matrix of ey. It is easy to see that when R,, = oz] and W = I, 
we obtain (8.2.42). 


PROPERTY 8.2.4. The trace of I',,), attains its mintmum when W = R,, ' The resulting estimator 
= X'#R, x) x! R7!y (8.2.47) 


is known as the minimum variance or Markov estimator and 1s the best linear unbiased estimator 


(BLUE). 


Proof. The proof is somewhat involved. Interested readers can see Goodwin and Payne (1977) 
and Scharf (1991). 


PROPERTY 8.2.5. If Re, = cob the LS estimator cj, is also the best linear unbiased estimator. 
Proof. It follows from (8.2.47) with the substitution Re, = a2, I 


PROPERTY 8.2.6. When the random observation vector eg has a normal distribution with mean 
zero and correlation matrix Re, = oF I, that is, when its components are uncorrelated, the LS 
estimator cj, is also the maximum likelihood estimator. 


Proof. Since the components of vector eg are uncorrelated and normally distributed with zero 
mean and variance oa, the likelihood function for real-valued eg is given by 


leo(n)| 
L(ce) = (8.2.48) 
i V2 oe, oo 205. 
and its logarithm by 
1 
In L(c) = —— ef! e, -= + InQno? = ae Xe)4 (y — Xe) + const (8.2.49) 
oe, ° 202, 


For complex-valued ey, the terms /216 6, and 202 in (8.2.48) are replaced by mo and o2 
respectively. Since the logarithm is a monotonic fnchons maximization of L(c) is reeiitvalent 
to minimization of In L(c). It is easy to see, by comparison with (8.2.8), that the LS solution 
maximizes this likelihood function. 


Stochastic data matrix 


We now extend the statistical properties of ¢}, from the preceding section to the sit- 
uation in which the data values in X are obtained from a random source with a known 
probability distribution. This situation is best handled by first obtaining the desired results 
conditioned on X, which is equivalent to the deterministic case. We then determine the 
unconditional results by (statistical) averaging over the conditional distributions using the 
following properties of the conditional averages. 

The conditional mean and the conditional covariance of a random vector x(¢), given 
another random vector y(¢), are defined by 


Mey = E{x(S)ly(O)} 
and Pyjy = EEX) — MyylIX) — Mey] | yO} 
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respectively. Since both quantities are random objects, it can be shown that 


My = E{x(f)} = EytE{x(O)ly()}} 
which is known as the law of iterated expectations and that 


(y) 
P= rz, wey 


xly 


which is called the decomposition of the covariance rule. This rule states that the covariance 
of a random vector x(¢) decomposes into the covariance of the conditional mean plus the 
mean of the conditional covariance. The covariance of the conditional mean, pu wel is given 


by 


PO) 2 Ey (Upeyly — telltaty — Bal”) 


where the notation i. ) indicates the covariance over the distribution of y(f). More details 
can be found in Greene (1993). 
PROPERTY 8.2.7. The LS estimator cj, is an unbiased estimator of cy. 
Proof. Taking the conditional expectation with respect to X of both sides of (8.2.37), we obtain 
E{eys|X} = Efeo[X} + (X4X)—!X# Efeg|X} (8.2.50) 
Now using the law of iterated expectations, we get 
E {cjg} = Ex{E(e1s|X}} = ¢o + E((X"X)~!X" Efeo|X}} 


Since E{e,|X} = 0, from assumption 3, we have E{c),} = Co. Thus cj, is also unconditionally 
unbiased. 


PROPERTY 8.2.8. The covariance matrix of ¢), corresponding to the error cj, — Co is 


Ty © El(els — €0)(€\s — Co) } = oe, E{(K" X)~"'} (8.2.51) 


Proof. From (8.2.42), the conditional covariance matrix of c},, conditional on X, is 
E{(€\s — Co) (Cis — C0) [X} = 02, (KEX)! (8.2.52) 
For the unconditional covariance, we use the decomposition of covariance rule to obtain 
El(c1s — C0) (Cs ~ €0)"} = Ex{E((e1s — €o)(€1s — €0)”|X}} 
+ Ex{(E{ers|X} — €0)(E{e1s|X} — €0)"} 
The second term on the right-hand side above is equal to zero since FE {¢),|X} = Co and hence 
El(e1s — €0)(€s — €0)"} = Ex{E{(C1s — €0)(C1s — €0)" 1X} 
= Ex{o2 (K4X)"!} = 02 E(K?X)-}} 
Thus the earlier result in (8.2.42) is modified by the expected value (or averaging) of (x#x)-!, 
One important conclusion about the statistical properties of the LS estimator is that the 
results obtained for the deterministic data matrix X are also valid for the stochastic case. 


This conclusion also applies for the Markov estimators and maximum likelihood estimators 
(Greene 1993). 


8.3 LEAST-SQUARES FIR FILTERS 


We will now apply the theory of linear LS error estimation to the design of FIR filters. The 
treatment closely follows the notation and approach in Section 6.4. Recall that the filtering 
error is 
M-1 
e(n) = y(n) — > h(k) x(n —k) & y(n) — ec x(n) (8.3.1) 


k=0 


where y(7) is the desired response, 
x(n) = [x(n) x(n — 1) --- x —- M+ pI (8.3.2) 
is the input data vector, and 
c=[co c1 «++ cm-l" (8.3.3) 
is the filter coefficient vector related to impulse response by cy, = h*(k). Suppose that 
we take measurements of the desired response y(n) and the input signal x(7) over the 


time interval O < n < N — 1. We hold the coefficients {cx}! —| of the filter constant 
within this period and set any other required data samples equal to zero. For example, 


at time n = O, that is, when we take the first measurement x(0), the filter needs the 
samples x (0), x(—1), ..., x(—M-+1) to compute the output sample 9 (0). Since the samples 
x(—1),...,x(—M + 1) are not available, to operate the filter, we should replace them with 


arbitrary values or start the filtering operation at time n = M — 1. Indeed, forM—1<n< 
N — 1, all the input samples of x(n) required by the filter to compute the output { sm) 
are available. If we want to compute the output while the last sample x (NV — 1) is still in the 
filter memory, we must continue the filtering operation untiln = N+ M —2. Again, we need 
to assign arbitrary values to the unavailable samples x(V),...,x(N + M —2). Most often, 
we set the unavailable samples equal to zero, which can be thought of as windowing the 
sequences x(n) and y(n) with a rectangular window. To simplify the illustration, suppose 
that VN = 7 and M = 3. Writing (8.3.1) forn = 0,1,..., N + M — 1 and arranging in 
matrix form, we obtain 


03> e*(0) y*(0) x*(0) 0 0 
e*(1) y*() ae) x7 Oy 0 
M-1> e* (2) y*(2) ey RLY FPO) 
e*(3) y*(3) x (3). 2" Q). x0): | Teg 
e*(4) | = | y*(4) | — | x*(4) x*@B) x*Q)] | cr (8.3.4) 
e*(5) y*(5) x*(S) x*(4) x*(3) | Lea 
N-1—> e* (6) y*(6) x*(6) x*(5)  x*(4) 
e*(7) 0. 0 x*(6)  x*(5) 
N+M—2-— | ¢6(g) 0 0 0 x*(6) 
or, in general, 
e=y-—Xc (8.3.5) 


where the exact form of e, y, and X depends on the range Ni < n < Nr of measurements 
to be used, which in turn determines the range of summation 


N, 
E= 3 le(n)|* =ee (8.3.6) 
n=N; 
in the LS criterion. The LS FIR filter is found by solving the LS normal equations 
(X?X)e,, = X"%y (8.3.7) 
or Re, =d (8.3.8) 
with an LS error of 
E\s = Ey — de, (8.3.9) 


where Ey is the energy of the desired response signal. The elements of the time-average 
correlation matrix R are given by 
Ne 
Ay = KF; = Yo x(n 4+1-ax*41—- jf) 1<i, j<M (8.3.10) 
n=N, 
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where x; are the columns of data matrix X. A simple manipulation of (8.3.10) leads to 


Fit, j4i =e + x(Nj —i)x* (Ni — jf) — x (Ne + 1 —i)x* (Ne +: 1- J) 1<i,j<M 
(8.3.11) 


which relates the elements of matrix R that are located on the same diagonal. This property 
holds because the columns of X are obtained by shifting the first column. The recursion in 
(8.3.11) suggests the following way of efficiently computing R: 


1. Compute the first row of R by using (8.3.10). This requires M dot products and a total 
of about M(N¢ — Nj) operations. 

2. Compute the remaining elements in the upper triangular part of R, using (8.3.11). This 
required number of operations is proportional to M7. 


3. Compute the lower triangular part of R, using the Hermitian symmetry relation? ;; = ri. 


Notice that direct computation of the upper triangular part of R using (8.3.10), that is, 
without the recursion, requires approximately M? N /2 operations, which increases signifi- 
cantly for moderate or large values of M. 

There are four ways to select the summation range Nj < n < Nf that are used in LS 
filtering and prediction: 


No windowing. If we set N; = M—1 and Ne = N —1, we only use the available data 
and there are no distortions caused by forcing the data at the borders to artificial values. 


Prewindowing. This corresponds to Nj = 0 and Ne = N — 1 and is equivalent 
to setting the samples x(0), x(—1),...,x(—M + 1) equal to zero. As a result, the term 
x(M — i)x(M — j) does not appear in (8.3.11). This method is widely used in LS adaptive 
filtering. 


Postwindowing. This corresponds to Ni = M — 1 and Ne = N + M — 2 and is 
equivalent to setting the samples x(NV),...,x(N + M — 2) equal to zero. As a result, the 
term x(M — i)x(M — j) does not appear in (8.3.11). This method is not used very often 
for practical applications without prewindowing. 


Full windowing. In this method, we impose both prewindowing and postwindowing 
(full windowing) to the input data and postwindowing to the desired response. The range 
of summation is from N; = 0 to Ne = N + M — 2, and as a result of full windowing, Eq. 
(8.3.11) becomes 741, ;+1 = ij. Therefore, the elements 7;;, depend on i — j, and matrix 
R is Toeplitz. In this case, the normal equations (8.2.12) can be obtained from the Wiener- 
Hopf equations (6.4.11) by replacing the theoretical autocorrelations with their estimated 
values (see Section 5.2). 

Clearly, as N >> M the performance difference between the various methods becomes 
insignificant. The no-windowing and full-windowing methods are known in the signal 
processing literature as the autocorrelation and covariance methods, respectively (Makhoul 
1975b). We avoid these terms because they can lead to misleading statistical interpretations. 
We notice that in the LS filtering problem, the data matrix X is Toeplitz and the normal 
equations matrix R = X"'Xis the product of two Toeplitz matrices. However, Ris Toeplitz 
only in the full-windowing case when X is banded Toeplitz. In all other cases R is near to 
Toeplitz or R is close to Toeplitz in a sense made precise in Morf, et al. (1977). 

The matrix R and vector d, for the various windowing methods, are computed by using 
the MATLAB function [R,d]=lsmatvec(x,M,method, y), which is based on (8.3.10) and 
(8.3.11). Then the LS filter is computed by cls=R\d. Figure 8.6 shows an FIR LSE filter 
operating in block processing mode. 
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x(n) Frame Compute and | ¢,, FIR Frame 


y(n) blocking solve normal filter unblocking 
equations 


FIGURE 8.6 
Block processing implementation of an FIR LSE filter. 


EXAMPLE 8.3.1. To illustrate the design of least-squares FIR filters, suppose that we have a set 
of measurements of x(n) and y(n) forO0 < n < N — 1 with N = 100 that have been generated 
by the difference equation 


y(n) = 0.5x(n) + 0.5x(n — 1) + v(n) 


The input x (7) and the additive noise v(m) are uncorrelated processes from a normal (Gaussian) 
distribution with mean E{x(n)} = E{v(n)} = 0 and variance on = o = |. Fitting the model 


$n) = hO)x(n) + h(x — VY) 
to the measurements with the no-windowing LS criterion, we obtain 


[0.5361 [ 0.0073 —0.0005 
0.5570 —0.0005 0.0071 


using (8.3.7), (8.3.9), (8.2.44), and (8.2.42). If the mean of the additive noise is nonzero, for 
example, if E{v(n)} = 1, we get 
[0.4889 

0.5258 


Cs = 62=1.0419 62R-1= 


Cis = 


3 225-1 | 0.0131 —0.0009 
62=18655 62R-l= 


—0.0009 0.0127 


which shows that the variance of the estimates, that is, the diagonal elements of 62R7 1 increases 
significantly. Suppose now that the recording device introduces an outlier in the input data at 
x(30) = 20. The estimated LS model and its associated statistics are given by 


_ [0.1796 2 erm g2gct — |9-0030 0.0000 
“ls = | 9.1814 ve °e™ 1] 0.0000 0.0030 


Similarly, when an outlier is present in the output data, for example, at y(30) = 20, then the LS 
model and its statistics are 


0.6303 s conn 0.0357 —0.0025 
C5 = 62=5.0979 62R-! = 
0.4653 —0.0025 0.0347 


In general, LS estimates are very sensitive to colored additive noise and outliers (Ljung 1987). 
Note that all the LS solutions in this example were produced with one sample realization x(n) 
and that the results will vary for any other realizations. 


LS inverse filters. Given a causal filter with impulse response g(n), its inverse filter 
h(n) is specified by g(n) * h(n) = d(n — no), no => 0. We focus on causal inverse filters, 
which are often infinite impulse response (IIR), and we wish to approximate them by some 
FIR filter cjs(2) = h*(n) that is optimum according to the LS criterion. In this case, the 
actual impulse response g(7) * c;.(1) of the combined system deviates from the desired 
response 6(n — no), resulting in an error e(7). The convolution equation 


M 
e(n) = 6(n — no) — Yo chk) g(n —k) (8.3.12) 
k=0 
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Prediction e*(1) 0 g*(1) g*) 0 
e*(2) 0 ga) er). agh() 
e*(3) 0 g*(3) g*(2)  g*C) | | cs(0) 
e*(4)| =|0}—]8*4) 8*G3) 8*@)| | as() 
e*(5) 0 g*(5) g*4) g*@)| Las) 
e*(6) 0 g*(6) g*5) g*A4) 
e*(7) 0 0 g*(6)  g*(5) 
e*(8) 0 0 0 g* (6) 
assuming that no = 0. In general, 
e = 6; — Gel) (8.3.13) 


where 6; is a vector whose ith element is | and whose remaining elements are all zero. The 
LS inverse filter and the corresponding error are given by 


(G4G)c? = G48; (8.3.14) 
and EO =1-8F Gc? =1-¢*@qG) O<i<M+N (8.3.15) 
respectively. 
Using the projection operators (8.2.29) and (8.2.30), we can express the LS error as 
ED = 3? P-HN*P_-N6; (8.3.16) 
where P = G(G"G)-'G"4 (8.3.17) 
The total error for all possible delays 0 < i < N + M can be written as 
N+M ; 
Ewa = >> EQ = uD" (P-)"(P- DDI (8.3.18) 
i=0 
where D = [8p 8; 62 --- 5yeul] =I 
is the (VN + M+ 1) x (N+ M + 1) identity matrix. Since D = I, P = P”, and P? = P, 
we obtain 
Evora = t(D" (P — 1) (P — DD] = trl — P) = tr) — tr(P) 
or Evotal = N (8.3.19) 
because tr(I) = N + M + 1 and 
tr(P) = t[G(G"G)'G"] = w[G"G(G"G)']}=M+1 (8.3.20) 


Hence, Fotaj depends on the length N + 1 of the filter g(”) and is independent of the length 
M +1 of the inverse filter c),(n). If the minimum EY; for a given N, occurs at delay i = ig, 
we have 


| ee (8.3.21) 


which shows that EY — 0as M —> o (Claerbout and Robinson 1963). 


EXAMPLE 8.3.2. Suppose that g(n) = 6(n) — ad(n — 1), where @ is a real constant. The exact 
inverse filter is 


A(z) = => h(n) =a"u(n) 


l—az! 


and is minimum-phase only if —1 < a < 1. The inverse LS filter for M = 1 and N > 2 is 
obtained by applying (8.3.14) with 


1 0 1 
G=|-a 1 and 6=1|0 
0 -a 0 


The normal equations are 


2 a . 
| +a? —a | be _ El (8.3.22) 
-~a L+a*] | cs(1) 0 


leading to the LS inverse filter 


1+ a2 a 
a= «=> 
with LS error 
at 
nS a Tae 


The system function of the LS inverse filter is 


H(z) 1+ a2 14 a -1 

CZ) = 4 

o 1+a2+a4 1402 

and has a zero at z] = —a/(1 + a?) = -I/(@+ a—!). Since |z1| < 1 for any value of a, the 
LS inverse filter is minimum-phase even if g() is not. This stems from the fact that the normal 


equations (8.3.22) specify a one-step forward linear predictor with a correlation matrix that is 
Toeplitz and positive definite for any value of a (see Section 7.4). 


8.4 LINEAR LEAST-SQUARES SIGNAL ESTIMATION 


We now discuss the application of the LS method to general signal estimation, FLP, BLP, and 
combined forward and backward linear prediction. The reader is advised to review Section 
6.5, which provides a detailed discussion of the same problems for the MMSE criterion. 
The presentation in this section closely follows the viewpoint and notation in Section 6.5. 


8.4.1 Signal Estimation and Linear Prediction 


Suppose that we wish to compute the linear LS signal estimator eS defined by 
M 
en) = Soe x(n—k) = cO4X(n) with c 
k=0 
from the data x(n), 0 <n < N — 1. Using (8.4.1) and following the process that led to 
(8.3.4), we obtain 


Ay (8.4.1) 


i 


e® = Xe” (8.4.2) 
x*(0) 0 geet 
x*(1) x*(0) eae A 
x*(M) x*(M—1) ++» x*(O) 
where X=|: ; (8.4.3) 
x*(N—1) x*(N—2) - x*(N-M-—1) 


0 x*(N—1) ++. x*(N-—M) 


0 0 s+. x*(N — 1) 
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is the combined data and desired response matrix with all the unavailable samples set equal 
to zero (full windowing). Matrix X can be partitioned columnwise as 


X =[X, y X)] (8.4.4) 


where y, the desired response, is the ith column of X. Using (8.4.4), we can easily show 


that the LS signal estimator cl) and the associated LS error E. Me are determined by 


0 
(XK? Xe? = | EY? (8.4.5) 
0 


where E? is the ith element of the right-hand side vector (see Problem 8.3). If we define 
the time-average correlation matrix 


R4x"x (8.4.6) 


and use the augmented normal equations in (8.4.5), we obtain a set of equations that have 
the same form as (6.5.12), the equations for the MMSE signal estimator. Therefore, after 
we have computed R, using the command Rbar=lsmatvec (x,M+1,method), we can use 
the steps in Table 6.3 to compute the LS forward linear predictor (FLP), the backward 
linear predictor (BLP), the symmetric smoother, or any other signal estimator with delay i. 
Again, we use the standard notation joe ) = Ef and ce ) =a for the FLP and EM ) = EP 
and ¢(”) = b for the BLP. 

All formulas given in Section 6.5 hold for LS signal estimators if the matrix R(7) 
is replaced by R. However, we stress that although the optimum MMSE signal estimator 
(n) is a deterministic vector, the LS signal estimator et? is a random vector that is a 
function of the random measurements x(7), 0 <n < N — 1. In the full-windowing case, 
matrix R is Toeplitz; if itis also positive definite, then the FLP is minimum-phase. Although 
the use of full windowing leads to these nice properties, it also creates some “edge effects” 
and bias in the estimates because we try to estimate some signal values using values that 
are not part of the signal by forcing the samples leading and lagging the available data 
measurements to zero. 


EXAMPLE 8.4.1. Suppose that we are given the signal segment x(n) = a”, 0 <n < N, where 
a is an arbitrary complex-valued constant. Determine the first-order one-step forward linear 
predictor, using the full-windowing and no-windowing methods. 


Solution. We start by forming the combined desired response and data matrix 
KH x(0) x(1) «+--+ x(N) 0 
10 x0) +++ x(N 1) x(N) 

For the full-windowing method, the matrix 

-  -Ae PO) Fr) 

R=xX7x=-|" * 

FECL) Fx (0) 

is Toeplitz with elements 


N N 2(N+1) 
a 1 — |a| 
7, (0) = > Ix(n)|? = y |o|?” = oa IO 
1-|a| 
n=0 n=0 


N 


N 
and AO=) @xtn- j= > ey =a 


n=l n=1 


FO) Fry] [1 E 
Pe) FOI] La] Lo 


2N 
vial 
1 = |a|? 


Therefore, we have 


whose solution gives 43 
(1) re) 1 = Ja?" SECTION 8.4 


=-a Linear Least-Squares 
- 2(N+1 
rx (0) 1—|a| en Signal Estimation 


dl) he ja|22N +1 
and Et =F, (0) + Fr (1ay ~ To a@zwFy 


Since for every sequence |r, (/)| < |r, (0)|, we have lai?) < < 1; that is, the obtained prediction 


error filter ae is minimum-phase. Furthermore, if |a| < 1, then limy_so an 


lim yo E! 1 = | = x(0). In the no-windowing case, the matrix 


= —a and 


is Hermitian but not Toeplitz with elements 


9 b= lel 2 ae 
i= Yor? = ap Pog = = |x(n)| ae 
N 2N 
: 1 = [al 
r2= x(n) x*(n — 1) = a* ———— 
2 Xu ne 


Solving the linear system 


fi Fy] tt Ef 
Pio ro a 0 
r 
we obtain at) =-— 2 = -a 
'22 
and EI =f] +Aigay” =0 


We see that the no-windowing method provides a perfect linear predictor because there is no 
distortion due to windowing. However, the obtained prediction error filter is minimum-phase 
only when |@| < 1. 


EXAMPLE 8.4.2. To illustrate the statistical properties of least-squares FLP, we generate K = 500 
realizations of the MA(1) process x(n) = w(n) + swin — 1), where w(n) ~ WN(O, 1) (see 
Example 6.5.2). Each realization x(¢;, 1) has duration N = 100 samples. We use these data 
to design an M = 2 order FLP, using the no-windowing LS method. The estimated mean and 
variance of the obtained K FLP vectors are 


—0.4695 


0.1889 


0.0092 


ee 
Mean{a(é;)} = 


and var{a(¢;)} = 


whereas the average of the variances 6 is 0.9848. We notice that both means are close to the 
theoretical values obtained in Example 6.5.2. The covariance matrix of a given LS estimate aj, 
was found to be 


aw] 0.0099 —0.0043 
R = 
—0.0043 0.0099 


whose diagonal elements are close to the components of var{a}, as expected. The bias in the 
estimate aj, results from the fact that the residuals in the LS equations are correlated with each 
other (see Problem 8.4). 


8.4.2 Combined Forward and Backward Linear Prediction (FBLP) 


For stationary stochastic processes, the optimum MMSE forward and backward linear 
predictors have even conjugate symmetry, that is, 


ay = Jb* (8.4.7) 
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because both directions of time have the same second-order statistics. Formally, this property 
stems from the Toeplitz structure of the autocorrelation matrix (see Section 6.5). However, 
we could possibly improve performance by minimizing the total forward and backward 
squared error 


Ne 
n=N, 


under the constraint 


a’? £a= Jb* (8.4.9) 
The FLP and BLP overdetermined sets of equations are 
oe bp ~|> 
e =X i and e =X 1 (8.4.10) 
- [1 -, | b* = 
or e =X lac and = e* = X* k = X*J ES (8.4.11) 


where we have used (8.4.9) and the property JJ = I of the exchange matrix. If we combine 


the above two equations as 
e! x 1 
be = xy] | afb (8.4.12) 


then the forward-backward linear predictor that minimizes E‘> is given by (see Problem 


8.5) 
x ]"TX Jf ]_ fee 
X*J| |X] fafh} jo 
ere ft 1 fb 
or a+ KD] 0 = | (8.4.13) 
a, 0 


which can be solved by using the steps described in Table 6.3. The time-average forward- 
backward correlation matrix 


Rey © X"X + IX’ XY (8.4.14) 
with elements 
AP = hig t+Py-im-j; OSi j<M (8.4.15) 


‘ ‘ we ae a ? : F 
is persymmetric; that is, JR¢pJ = Ry, and its elements are conjugate symmetric about both 
main diagonals. In MATLAB we compute Rrp by these commands: 


Rbar=lsmatvec (x,M+1,method) 
Rfb=Rbar+flipud(fliplr(conj(Rbar) ) ) 


The FBLP method is used with no windowing and was originally introduced indepen- 
dently by Ulrych and Clayton (1976) and Nuttall (1976) as a spectral estimation technique 
under the name modified covariance method (see Section 9.2). If we use full windowing, 
then a'> = (a + Jb*) /2 (see Problem 8.6). 


8.4.3 Narrowband Interference Cancelation 


Several practical applications require the removal of narrowband interference (NBI) from 
a wideband desired signal corrupted by additive white noise. For example, ground and 


foliage-penetrating radars operate from 0.01 to 1 GHz and use either an impulse or a chirp 
waveform. To achieve high resolution, these waveforms are extremely wideband, occupying 
at least 100 MHz within the range of 0.01 to 1 GHz. However, these frequency ranges are 
extensively used by TV and FM stations, cellular phones, and other relatively narrowband 
(less than 1 MHz) radio-frequency (RF) sources. Clearly, these sources spoil the radar 
returns with narrowband RF interference (Miller et al. 1997). Since the additive noise is 
often due to the sensor circuitry, it will be referred to as sensor thermal noise. Next we 
provide a practical solution to this problem, using an LS linear predictor. Suppose that the 
corrupted signal x(7) is given by 
x(n) = s(n) + y(n) + v(n) (8.4.16) 
where s(n) = signal of interest (8.4.17) 
y(n) = narrowband interference 
v(n) = thermal (white) noise 
are the individual components, assumed to be stationary stochastic processes. 
We wish to design an NBI canceler that estimates and rejects the interference signal 
y(n) from the signal x(7), while preserving the signal of interest s(n). Since signals y(n) 


and x(n) are correlated, we can form an estimate of the NBI using the optimum linear 
estimator 


$(n) = ce! x(n — D) (8.4.18) 

where Re, =d (8.4.19) 
R = E{x(n — D)x" (n — D)} (8.4.20) 

d = E{x(n — D) y*(n)} (8.4.21) 


and D is an integer delay whose use will be justified shortly. Note that if D = 1, then 
(8.4.18) is the LS forward linear predictor. If }(n) = y(n), the output of the canceler is 
x(n) — ~(n) = s(n) + v(n); that is, the NBI is completely excised, and the desired signal 
is corrupted by white noise only and is said to be thermal noise—limited. 

Since, in practice, the required second-order moments are not available, we need to 
use an LS estimator instead. However, the quantity X’7y in (8.2.21) requires the NBI signal 
y(n), which is also not available. To overcome this obstacle, consider the optimum MMSE 
D-step forward linear predictor 


e'(n) = x(n) +a” x(n — D) (8.4.22) 
Ra = —r' (8.4.23) 

where R is given by (8.4.20) and 
r' = E{x(n — D)x*(n)} (8.4.24) 


In many NBI cancelation applications, the components of the observed signal have the 
following properties: 


1. The desired signal s(”), the NBI y(v), and the thermal noise v(m) are mutually uncor- 
related. 

2. The thermal noise v(7) is white; that is, 7, (7) = o7 (I). 

3. The desired signal s(n) is wideband and therefore has a short correlation length; that is, 
ry(l) = 0 for |/| > D. 

4. The NBI has a long correlation length; that is, its autocorrelation takes significant values 
over the range 0 < |/| < M for M > D. 


In practice, the second and third properties mean that the desired signal and the thermal 
noise are approximately uncorrelated after a certain small lag. These are precisely the 
properties exploited by the canceler to separate the NBI from the desired signal and the 
background noise. 
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As a result of the first assumption, we have 
E{x(n — k)y*(n)} = E{y(n — k)y*(n)} = ry (k) for all k (8.4.25) 
and rO=7nO+nO+n@ (8.4.26) 
Making use of the second and third assumptions, we have 
nd=ry@ for! £0,1,...,D—-1 (8.4.27) 


The exclusion of the lags for! ~£ 0, 1,..., D—1inrand dis critical, and we have arranged 
for that by forcing the filter and the predictor to form their estimates using the delayed data 
vector x(n— D). From (8.4.21), (8.4.24), and (8.4.27), we conclude that d = r‘ and therefore 
Co = a,. Thus, the optimum NBI estimator c, is equal to the D-step linear predictor ao, 
which can be determined exclusively from the input signal x(n). The cleaned signal is 


x(n) — $(n) = x(n) +a! x(n — D) =e! (n) (8.4.28) 


which is identical to the D-step forward prediction error. This leads to the linear prediction 
NBI canceler shown in Figure 8.7. 


Corrupted Cleaned FIGURE 8.7 
signal signal Block diagram of linear prediction 
NBI canceler. 


x(n) 


Forward 


linear 
predictor 


To illustrate the performance of the linear prediction NBI canceler, we consider an 
impulse radar operating in a location with commercial radio and TV stations. The desired 
signal is a short-duration impulse corrupted by additive thermal noise and NBI (see Figure 
8.8). The spectrum of the NBI is shown in Figure 8.9. We use a block of data (NV = 4096) 
to design an FBLP with D = 1 and M = 100 coefficients, using the LS criterion with no 
windowing. Then we compute the cleaned signal, using (8.4.28). The cleaned signal, its 
spectrum, and the magnitude response of the NBI canceler are shown in Figures 8.8 and 
8.9. We see that the canceler acts as a notch filter that optimally puts notches at the peaks 
of the NBI. A detailed description of the design of optimum least-squares NBI cancelers is 
given in Problem 8.27. 


8.5 LS COMPUTATIONS USING THE NORMAL EQUATIONS 


The solution of the normal equations for both MMSE and LSE estimation problems is com- 
puted by using the same algorithms. The key difference is that in MMSE estimation R and 
d are known, whereas in LSE estimation they need to be computed from the observed input 
and desired response signal samples. Therefore, it is natural to want to take advantage of the 
same algorithms developed for MMSE estimation in Chapter 7, whenever possible. How- 
ever, keep in mind that despite algorithmic similarities, there are fundamental differences 
between the two classes of estimators that are dictated by the different nature of the the 
criteria of performance (see Section 8.1). In this section, we show how the computational 
algorithms and structures developed for linear MMSE estimation can be applied to linear 
LSE estimation, relying heavily on the material presented in Chapter 7. 


8.5.1 Linear LSE Estimation 417 


SECTION 8.5 


The computation of a general linear LSE estimator requires the solution of a linear system LS Computations Using the 
Normal Equations 


Re, =d (8.5.1) 


where the time-average correlation matrix R is Hermitian and positive definite [see (8.2.25)]. 
We can solve (8.5.1) by using the LDL” or the Cholesky decomposition introduced in 
Section 6.3. The computation of linear LSE estimators involves the steps summarized in 
Table 8.1. We again stress that the major computational effort is involved in the computation 
of R and d. 

Steps 2 and 3 in (6.3.16) can be facilitated by a single extended LDL” decomposition. 
To this end, we form the augmented data matrix 


X =[Xy] (8.5.2) 
and compute its time-average correlation matrix 


- _,, |X¥x x'y] [R d 
R=X"X=| | |= lee (8.5.3) 
yx. yy d” Ey 


Impulse Impulse + White noise 


950 1000 1050 , 950 1000 1050 
Time (ns) Time (ns) 


Impulse + White noise + NBI After NBI excision: M@ = 100 


5 2.5 
0 
5 5 _____ 
950 1000 1050 950 1000 1050 
Time (ns) Time (ns) 
FIGURE 8.8 


NBI cancelation: time-domain results. 
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Observed signal spectrum 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


NBI canceler response: M = 100 


Power (dB) 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


Cleaned signal spectrum 


Power (dB) 


0 0.1 0.2 0.3 0.4 0.5 0.6 


Frequency (GHz) 


0.7 0.8 0.9 1 


FIGURE 8.9 
NBI cancelation: frequency-domain results. 


TABLE 8.1 
Comparison between the LDL” and Cholesky decomposition methods for the 
solution of normal equations. 


Step LDL? decomposition Cholesky decomposition Description 


1 R=X'x,d= xHy Normal equations Re, =d 

2 R = LDL” R=ccL# Triangular decomposition 

3 LDk = d ck=d Forward substitution > k or k 
4 LH cj; =k cH qj, =k Backward substitution > cj, 

5 E\, = Ey —k# Dk y= Ey otk LSE computation 

6 ely = y — Xejy els = y — Xejg Computation of residuals 


We then can show (see Problem 8.9) that the LDL” decomposition of R is given by 
- |L O|/D 0 |/L? kf 
R= 


8.5.4 
k= 1//0% &,)/0% 1 eae) 


and thus provides the vector k and the LSE F),. Therefore, we can solve the normal equations 
(8.5.1), using the LDL” decomposition of R to compute L and k and then solving L” ¢, = k 
to compute Cjs. 

A careful inpection of the design equations for the general, mth-order, MMSE and 
LSE estimators, derived in Chapter 6 and summarized in Table 8.2, shows that the LSE 


TABLE 8.2 
Comparison between the MMSE and LSE normal equations for general linear 
estimation. 


MMSE LSE 
Available information Rin (1), dm(n) {Xm(n), y(n), nj <n < ng} 
Normal equations Rm (2)em(n) = dm (n) Rin€m = dn 
Minimum error Pm(n) = Py(n) — a (n)€m (n) Em = Ey — dem 
N-1 
Correlation matrix Rn(n) 4 E{xm (n)x# (n)} Rn = XEX, = ee Xn (n)xil (n) 
n=0 
N-1 
Cross-correlation vector dy (n) & E{xm(n)y*(n)} dm = xy = a Xm(n)y*(n) 
n=0 
N-1 
Power Py(n) = E{ly(n)|?} Ey =y#y= D> |y@)/? 
n=0 


equations can be obtained from the MMSE equations by replacing the linear operator E{-} 
by the linear operator }*,,(-). As a result, all algorithms developed in Sections 7.1 and 7.2 
can be used for linear LSE estimation problems. 

For example, we can easily see that Ry, dy. Ly, Dy, and ky have the optimum 
nesting property described in Section 7.1.1, that is, R, = RP and so on. As a result, the 
factors of the LDL” decomposition have the optimum nesting property, and we can obtain 
an order-recursive structure for the computation of the LSE estimate 4,,(). Indeed, if we 
define 


Wn(n) =Lolxm(n)  O<n<N-1 (8.5.5) 
N-1 N-1 

then Ry = ) > Xm(n)x (n) = Ln [> Wn (we | Lm = LmDnLi (8.5.6) 
n=0 n=0 


where the matrix D,, is diagonal because the LDL” decomposition is unique. If we define 
the record vectors 


w; 4 [wj() wj(1) --- wj(N— 1D]? (8.5.7) 
and the data matrix 
Win = [wi Wo ae Wn] (8.5.8) 
then Dn = W!Win = diag{&,,&5,..., En} (8.5.9) 
N-1 
where g,= ~~ |w;(n)|? = wi w; (8.5.10) 
n=0 
From (8.5.9), we have 
wiw;=0 fori #j (8.5.11) 


that is, the columns of W,,, are orthogonal and, in this sense, are the innovation vectors of the 
columns of data matrix X,,, according to the LS interpretation of orthogonality introduced 
in Section 8.2. 
Following the approach in Section 7.1.5, we can show that the following order-recursive 
algorithm 
m—1 
Wm (Mn) = Xm(n) — Yo *wi(n) 


i=l 


(8.5.12) 


dm (n) = 3m—1(1) oF ki, Wm (n) 
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or €m(N) = €m—1 (2) — ky, Wm (0) 


computed form = 0,1,...,N —1andm = 1,2,..., M, provides the LSE estimates for 
orders 1 <m< M. 

The statistical interpretations of innovation and partial correlation for w,,(n) and ky»+1 
hold now in a deterministic LSE sense. For example, the partial correlation between y and 
Xm+1 is defined by using the residual records €,, = y — X,,Cm and ee = Xm+1 + Xmbm, 
where b,, is the least-squares error BLP. Indeed, if 6,4; 4S @He@>. we can show that 
Km+i = Bm4i/Em41 (See Problem 8.11). 


m? 


EXAMPLE 8.5.1. Solve the LS problem with the following data matrix and desired response 


signal: 
1 1 1 1 
X= 22> 11 _ 42 
ad ae ee r= Wg 
1 0 1 3 
Solution. We start by computing the time-average correlation matrix and cross-correlation 
vector 
15 8 13 20 
R=| 8 6 6 d=| 9 
13 6 12 18 
followed by the LDL? decomposition of R using the MATLAB function [L,D]=1d1t (X) . This 
gives 
1 0 0 15 0 0 
L = | 0.5333 1 0 D=] O- 1.7333 0 
0.8667 —0.5385 1 0 0 0. 2308 


and working through the steps in Table 8.1, we find the LS solution and LSE to be 
cj, = (3.0 -1.5 —1.0]" Ey, = 1.5 
using the following sequence of MATLAB commands 


k=L\ dhat; 
cls=L’\ k; 
Els=sum((y-X’*cls) .*2); 


These results can be verified by using the command cls=Rhat\dhat. 


8.5.2 LSE FIR Filtering and Prediction 


As we stressed in Section 7.3, the fundamental difference between general linear estimation 
and FIR filtering and prediction, which is the key to the development of efficient order- 
recursive algorithms, is the shift invariance of the input data vector 


Xm41(2) = [x(n) x(n — 1) --- x — m+ 1) x(n—- m)|" (8.5.13) 
The input data vector can be partitioned as 
fxn) Px) 
Xm+1(2) = ie ¥ al = EB tia J (8.5.14) 


which shows that samples from different times are incorporated as the order is increased. 
This creates a coupling between order and time updatings that has significant implications 
in the development of efficient algorithms. Indeed, we can easily see that the matrix 

Ne 


Rigi = > Xt 1X1 (0) (8.5.15) 
n=N, 


can be partitioned as 


‘ R,, El pla 
Rnsi = ke "| = By Ss (8.5.16) 
ln Em Tn Ri 
where RE = Ry + xm(Ni — Dx! (Nj — 1) — Xn (Np) x! (Np) (8.5.17) 


is the matrix equivalent of (8.2.28). We notice that the relationship between RE, and Rn, 
which allows for the development of a complete set of order-recursive algorithms for FIR 
filtering and prediction, depends on the choice of Nj and N-, that is, the windowing method 
selected. 

As we discussed in Section 8.3, there are four cases of interest. In the full-windowing 
case (Ni = 0, Ne = N+ M — 2), we have R = R,, and R,,, is Toeplitz. Therefore, all the 
algorithms and structures developed in Chapter 7 for Toeplitz matrices can be utilized. 

In the prewindowing case (N; = 0, Ne = N — 1), Equation (8.5.17) becomes 


RE = Ry —xm(N — Ix? (N — 1) (8.5.18) 


Since x,,(1) = 0 for n < 0 (prewindowing), R, is a function of N. If we use the definition 
N-1 

Rn(N) = Yo xm(a)xpi (a) (8.5.19) 
n=0 


then the time-updating (8.5.18) can be written as 
RE =Rn(N — 1) = Rn (N) — Xm(N — 1x? (N - 1) (8.5.20) 
and the order-updating (8.5.16) as 


Rin (N) ee BN) 


~ (RCN) Rn = 1) 


(8.5.21) 
fA (N) E> (N) 


Rng i(N) = 


which has the same form as (7.3.3). Therefore, all order recursions developed in Section 
7.3 can be applied in the prewindowing case. However, to get a complete algorithm, we 
need recursions for the time updatings of the BLP b,,(N — 1) — b,,(N) and E> (N - 
1l-oE Ls (N), which can be developed by using the time-recursive algorithms developed 
in Chapter 10 for LS adaptive filters. The postwindowing case can be developed in a similar 
fashion, but it is of no particular practical interest. ‘ s 

In the no-windowing case (Ni = M — 1, Ne = N — 1), matrices R,, and R‘ depend 
on both M and N. Thus, although the development of order recursions can be done as in 
the prewindowing case, the time updatings are more complicated due to (8.5.17) (Morf 
et al. 1977). Setting the lower limit to N; = M — 1 means that all filters c,,, 1 <m < M, 
are optimized over the interval M — 1 <n < N — 1, which makes the optimum nesting 
property possible. If we set Nj = m — 1, each filter c,, is optimized over the interval 
m—1<n< N —1s; that is, it utilizes all the available data. However, in this case, the 
optimum nesting property R, = Rin does not hold, and the resulting order-recursive 
algorithms are slightly more complicated (Kalouptsidis et al. 1984). 

The development of order-recursive algorithms for FBLP least-squares filters and pre- 
dictors with linear phase constraints, for example, ¢,, = +Jc;,, is more complicated, in 
general. A review of existing algorithms and more references can be found in Theodoridis 
and Kalouptsidis (1993). 

In conclusion, we notice that order-recursive algorithms are more efficient than the 
LDL” decomposition—based solutions only if N is much larger than M. Furthermore, their 
numerical properties are inferior to those of the LDL” decomposition methods; therefore, 
a bit of extra caution needs to be exercised when order-recursive algorithms are employed. 
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8.6 LS COMPUTATIONS USING ORTHOGONALIZATION TECHNIQUES 


When we use the LDL” or Cholesky decomposition for the pompyLaion of LSE filters, we 
first must compute the time-average correlation matrix R = XX and the time- -average 
cross-correlation vector d = X” y from the data X and y. Although this approach is widely 
used in practice, there are certain applications that require methods with better numerical 
properties. When numerical considerations are a major concern, the orthogonalization tech- 
niques, discussed in this section, and the singular value decomposition, discussed in Section 
8.7, are the methods of choice for the solution of LS problems. 

Orthogonal transformations are linear changes of variables that preserve length. In 
matrix notation 


y=Q"x (8.6.1) 
where Q is an orthogonal matrix, that is, 
Q'=% = QQt=l (8.6.2) 
From this property, we can easily see that 
lly? = y"4y = x4 QQ" x = x4 x = |x|? (8.6.3) 


that is, multiplying a vector by an orthogonal matrix does not change the length of the 
vector.’ Asa result, algorithms that use orthogonal transformations do not amplify roundoff 
errors, resulting in more accurate numerical algorithms. There are two ways to look at the 
solution of LS problems using orthogonalization techniques: 


e Use orthogonal matrices to transform the data matrix X to a form that simplifies the 
solution of the normal equations without affecting the time-average correlation matrix 
R = X"X. For any orthogonal matrix Q, we have 


R = X"x = x4QQ"x = (Q7x)"Q"x (8.6.4) 

Clearly, we can repeat this process as many times as we wish until the matrix X” Q; Qo --- 
is in a form that simplifies the solution of the LS problem. 

e Since orthogonal transformations preserve the length of a vector, multiplying the residual 


e = y — Xc by an orthogonal matrix does not change the total squared error. Hence, 
multiplying the residuals by Q” gives 


min ||e|] = min |ly — Xe|| = min lO (y — Xe)|| (8.6.5) 
Thus, the goal is to find a matrix Q that simplifies the solution of the LS problem. 
Suppose that we have already found an N x N orthogonal matrix Q such that 
R 
x= 8.6.6 
a 366 
where, in practice, Q is constructed to make the M x M matrix upper triangular.” Using 
(8.6.5), we have 

lel = 1Q%el| = 1Q“y — Q"Xe| (8.6.7) 

Using the partitioning 


Q = [Qi Qi] (8.6.8) 


where Q, has M columns, we obtain 
X=Q|R (8.6.9) 


"Matrix Q is an arbitrary unitary matrix and should not be confused with the eigenvector matrix of R. 

"The symbol U would be more appropriate for the upper triangular matrix R which can also be mistaken for the 
correlation matrix R. However, we chose R because, otherwise, it would be difficult to use the well-established 
term QR factorization. 


which is known as the “thin” QR decomposition. Similarly, 
Qiy Zz 
z2£Q4ya| '° 4 i (8.6.10) 
Qvy| LZ 
where z; has M components and z2 has N — M components. Substitution of (8.6.9) and 


(8.6.10) into (8.6.7) gives 
ka 7 ke 
0 Qvy 


Since the term z2 = Qi y does not depend on the parameter vector c, the length of |le|| 
becomes minimum if we set c = Cjg, that is, 


Rey = 7 (8.6.12) 
and E\s = 1Q5/ yl? = |lz2 ll (8.6.13) 


where the upper triangular system in (8.6.12) can be solved for ¢), by back substitution. 
The steps for the solution of the LS problem using the QR decomposition are summa- 
rized in Table 8.3. 


Re-7 


—Z2 


llell = a (8.6.11) 


TABLE 8.3 
Solution of the LS problem using the QR decomposition 
method. 
Step Computations Description 
R ys 
1 X=Q 0 QR decomposition 
Zz 
2 z= QU y= Transformation and partitioning of y 
Z2 
3 Re, = Z Backward substitution > cj, 
4 Es = IIz2||2 Computation of LS error 
0 
3 ej, =Q Back transformation of residuals 
ZZ 


Using the QR decomposition (8.6.6), we have 


R=X4x=RIR (8.6.14) 
which, in conjunction with the unique Cholesky decomposition R=cLL", gives 
Rea L* (8.6.15) 


that is, the QR factorization computes the Cholesky factor ® directly from data matrix X. 
Also, since Las = k, we have 


k=z (8.6.16) 
which, owing to the Cholesky decomposition, leads to 
Ejs = Ey —k# kk = ||z0||? (8.6.17) 


2 Hy 2 2 2 
because |ly||~ = |lQ” yll* = llzill° + liza’. 
If we form the augmented matrix 


X= [Xy] (8.6.18) 


the QR decomposition of X provides the triangular factor 


eae (8.6.19) 
0 = 6. 
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which is identical to the one obtained from the Cholesky decomposition of R = X”X with 
R=L" and z = E}, (see Problem 8.14). 


EXAMPLE 8.6.1. Solve the LS problem in Example 8.5.1 
1 1 | | 
22 tl 2 
aa a 
1 0 1 3 
using the QR decomposition approach. 


Solution. Using the MATLAB function [Q,R]=qr (X), we obtain 


[—0.2582 —0.3545 0.8006 el 
g — | 705164 —0.7089 0.4804 0.0000 
~ |-0.7746 0.4557 0.1601 —0.4082 
| 0.2582 0.4051 —0.3203 0.8165 
[—3.8730 —2.0656 —3.5666 
0 —1.3166 0.7089 
R= 
0 0 0.4804 | 
Lo 0 0 


and following the steps in Table 8.3, we find the LS solution and the LSE to be 
cj, = (3.0 -1.5 —1.0]" Ej, = 1.5 
using the sequence of MATLAB commands 


2=Q' *y; 
cls=R(1:3,1:3)’\z (1:3); 
Els=sum(z (4) .2); 


In applications that require only the error (or residual) vector ej,, we do not need to solve 
the triangular system Rej, = 21. Instead, we can compute directly the error by ej; = Q[Y,] 
or the MATLAB command e=Q* [zeros (1,M) 22’]’. This approach is known as direct error 
(or residual) extraction and plays an important role in LS adaptive filtering algorithms and 
architectures (see Chapter 10). 


It is generally agreed in numerical analysis that orthogonal decomposition methods 
applied directly to data matrix X are preferable to the computation and solution of the 
normal equations whenever numerical stability is important (Hager 1988; Golub and Van 
Loan 1996). The sensitivity of the solution ¢), to perturbations in the data X and y depends on 
the ratio of the largest to the smallest eigenvalues of R and does not depend on the algorithm 
used to compute the solution. Furthermore, the numerical accuracy required to compute £ 
directly from X is one-half of that required to compute £ from R. The “squaring” R =Xx"x 
of the data to form the time-average correlation matrix results in a loss of information and 
should be avoided if the numerical precision is not deemed sufficient. Algorithms that 
compute £ directly from X are known as square root methods. However, by paraphrasing 
Rader (1996), we use the terms amplitude-domain techniques for methods that compute £ 
directly from X and power-domain techniques for methods that compute L indirectly from 
R = X"X. These ideas are illustrated in the following example. 


EXAMPLE 8.6.2. Let 
1 1 
‘ It+e? 1 
x= € 0 R= x!’x — 4 
1 l+e 
0 «€ 


where X! X is clearly positive definite and nonsingular. Let the desired signal be y = [2 € a 
so that d = [2+ 2 2+€?]!. Ife is such that 1 + €? = 1, due to limited numerical precision, 


the matrix XX becomes singular. If we sete = 10-8, solving the LS equations for c), using the 
MaTLaB command cls=Rhat\dhat is not possible since R is singular to the working precision 
of MATLAB. However, if the problem is solved using the QR decomposition as shown in Example 
8.6.1, we find ej, = [1 1)". Note that even for slightly larger values of « the MATLAB command 
cls=Rhat\dhat is able to find a solution that differs from the true LS solution since R is ill 
conditioned. 


There are two classes of orthogonal decomposition algorithms: 


1. Methods that compute the orthogonal matrix Q: Householder reflections and Givens 
rotations 
2. Methods that compute Q}: classical and modified Gram-Schmidt orthogonalizations 


These decompositions are illustrated in Figure 8.10. The cost of the QR decomposition using 
the Givens rotations is twice the cost of using Householder reflections or the Gram-Schmidt 
orthogonalization. The standard method for the computation of the QR decomposition and 
the solution of LS problems employs the Householder transformation. The Givens rotations 
are preferred for the implementation of adaptive LS filters (see Chapter 10). 


“Thin” QR 
decomposition 


< M— 


Full QR 
decomposition 


FIGURE 8.10 
Pictorial illustration of the differences between thin and full QR decompositions. 


<—— => 


<——— 2] 


8.6.1 Householder Reflections 


Consider a vector x and a fixed line / in the plane (see Figure 8.11). If we reflect x about 
the line /, we obtain a vector y that is the mirror image of x. Clearly, the vector x and its 
reflection y have the same length. We define a unit vector w in the direction of x — y as 


we es Se —y) (8.6.20) 
IIx — yll 


assuming that x and y are nonzero vectors. 
Since the projection of x on w is (wx)w, simple inspection of Figure 8.11 gives 


y=x- 2(w"'x)w =x— 2(ww")x = (1- 2ww")x £ Hx 
where H 4 1-2ww” (8.6.21) 


In general, any matrix H of the form (8.6.21) with ||w|| = 1 is known as a Householder 
reflection or Householder transformation (Householder 1958) and has the following prop- 
erties 


H” =H H’H=I1 H'!=H"” (8.6.22) 


that is, the matrix H is unitary. 
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FIGURE 8.11 
The Householder reflection vector. 


a“ Projection of x on w 


We can build a Householder matrix Hy, that leaves intact the first k — 1 components 
of a given vector x, changes the kth component, and annihilates (zeros out) the remaining 
components, that is, 


ee GSH rea 
yi= (Bx) =} i=k (8.6.23) 
O° PS hey 


where yz, is to be determined. If we set 


N 1/2 
ye=t (> ir) elk (8.6.24) 
i=k 


where 6; is the angle part of x; (if complex-valued), then both x and y have the same length. 
There are two choices for the sign of yz. Since the computation of w by (8.6.20) involves 
subtraction (which can lead to severe numerical problems when two numbers are nearly 
equal), we choose the negative sign so that yz and x, have opposite signs. Hence, yx — x, 
is never the difference between nearly equal numbers. Therefore, using (8.6.20), we find 
that w is given by 


0 
1 0 . 
We | (x4 | + 5g 01% (8.6.25) 
JV 25K (SK + | xK|) 
Xk+1 
XN 


N 1/2 
where a (>: mt (8.6.26) 
i=k 


In general, an N x M matrix X with N > M can be diagonalized with a sequence of 
M Householder transformations 


Hy --- HHiX=R (8.6.27) 
or X=QR (8.6.28) 
where Q +H, --: Hy (8.6.29) 


Note that for M = N we need only M — | reflections. 
We next illustrate by an example how to compute the QR decomposition of arectangular 
matrix by using a sequence of Householder transformations. 


EXAMPLE 8.6.3. Find the QR decomposition of the data matrix 


xX = 


nN eR 
sa WW 


using Householder reflections. 


Solution. Using (8.6.25), we compute the vector w; = [0.7603 0.2054 0.6162]? and the 
Householder reflection matrix H, for the first column of X. The modified data matrix is 
—6.4031 —7.8087 
H,X=] 0 0.3501 
0 —0.9496 
Similarly, we compute the vector w2 = [0 0.8203 —0.5719]7 and matrix H» for the second 
column of H, X, which results in the desired QR decomposition 
[ —6.4031 —7.8087 
HoH,;X =R= 0 —1.0121 
0 0 


| —0.1562 —0.7711 ee 
Q =H, Hp = | —0.3123 —0.5543 —0.7715 
| —0.9370 0.3133 —0.1543 


This result can be verified by using the MATLAB function [Q, R] =qr (X), which implements the 
Householder transformation. 


8.6.2 The Givens Rotations 


The second elementary transformation that does not change the length of a vector is a 
rotation about an axis (see Figure 8.12). To describe the method of Givens, we assume for 
simplicity that the vectors are real-valued. The components of the rotated vector y in terms 
of the components of the original vector x are 


y, =r cos(@+ @) = x1 cos@ — x2 sind 
y2 =r sin(o+ 6) = x; sind + x2 cos@ 


"| oy — sin | * % * 
|) = G(6) (8.6.30) 
y2 sin 0 cos 0 | | x2 x2 


FIGURE 8.12 
The Givens rotation. 


or in matrix form 
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where @ is the angle of rotation. We can easily show that the rotation matrix G(@) in (8.6.30) 
is orthogonal and has a determinant det G(@) = 1. 
Any matrix of the form 


1 0 0 0 
QO --: ¢ —s O| <i 
Gij) = |: DoS (8.6.31) 
0 Ss c Oo; <j 
0 0 0 1 
ii t 
i J 
with cy? 1 (8.6.32) 


is known as a Givens rotation. When this matrix is applied to a vector x, it rotates the 
components x; and x; through an angle @ = arctan (s/c) while leaving all other components 
intact (Givens 1958). Because of (8.6.30), we can write c = cos @ and s = sin @ for some 
angle 6. It can easily be shown that the matrix G;; (0) is orthogonal. 

The Givens rotations have two attractive features. First, performing the rotation y = 
G;;(0)x as 


Yi = CXi — SXF 
yj = sx; tex; (8.6.33) 


Yk = Xk k #i,J 
requires only four multiplications and two additions. Second, we can choose c and s to 
annihilate the jth component of a vector. Indeed, if we set 

Xj Xj 


ae epee = (8.6.34) 


[x7 + x5 fe 
y= | x63 + xt and yj =0 (8.6.35) 


Using a sequence of Givens rotations, we can annihilate (zero out) all elements of a matrix 
X below the main diagonal to obtain the upper triangular matrix of the QR decomposition. 
The product of all the Givens rotation matrices provides matrix Q. We stress that the order 
of rotations cannot be arbitrary because later rotations can destroy zeros introduced earlier. 
A version of the Givens algorithm without square roots, which is known as the fast Givens 
QR, is discussed in Golub and Van Loan (1996). 

We illustrate this procedure with the next example. 


in (8.6.31), then 


EXAMPLE 8.6.4. The QR decomposition can be found in order to find the LS solution using the 
Givens rotations. Given the same data matrix X as in Example 8.6.3 


be 2 
X=|2 3 
6 7 


we first zero the last element of the first column, that is, element (3, 1), using the Givens rotation 
matrix G3, with c = —0.1664 and s = 0.9864. Indeed, using (8.6.34), we have 
—6.0828 ea 
G3,X = 2 3 
0 0.8220 | 


Then the element (2, 1) is eliminated by using the Givens rotation matrix G2; with c = 0.9550 
and s = 0.3123, resulting in 
—6.4031 —7.8087 
G21G3,X = 0 0.5905 
0 0.8220 
Finally, the QR factorization is found after applying the Givens rotation matrix G3 with c = 
—0.5834 and s = 0.8122: 
—6.4031 —7.8087 
R = G32G21G3;X = 0 —1.0121 
0 0 
ipees —0.7711 ae 
Q = Gi,G6/,6%, = | -0.3123 —0.5543  -0.7715 
—0.9370 0.3133 —0.1543 


which, as expected, agrees with the QR decomposition found in Example 8.6.3. 


In the case of complex-valued vectors, the components of rotated vector y in (8.6.30) 


are given by 
cos 0 —e—J¥ sind 
? i ing * (8.6.36) 
y2 e/¥ sind cos@ x2 


where c = cos @ ands = e/¥ sin @. The element —s of the rotation matrix G; j(@) isreplaced 
by —s*, where c? + |s|? = 1 instead of (8.6.32). 


8.6.3 Gram-Schmidt Orthogonalization 


If we are given a set of M linearly independent vectors x1, x2,..., Xj, we can create 
an orthonormal basis qi, q2,...,Qw that spans the same space by using a systematic 
procedure known as the classical Gram-Schmidt (GS) othogonalization method (see also 
Section 7.2.4). The GS method starts by choosing 


XI 
= (8.6.37) 
IIx1 | 
as the first basis vector. To obtain qo, we express x2 as the sum of two components: its 


projection (qi? X2)qi onto q; and a vector p2 that is perpendicular to q;. Hence, 


P2 = x2 — (aj’x2)q1 (8.6.38) 
and qp is obtained by normalizing pz, that is, 


eo = (8.6.39) 

IIp2 I 
The vectors q; and q2 have unit length, are orthonormal, and span the same space as xj 
and x2. In general, the orthogonal basis vector q; is obtained by removing from x; its 


projections onto the already computed vectors q; to q;_1. Therefore, we have 


jv! 
pp=xj—Si@Mxpqr and = qp = PL (8.6.40) 
F=1 


~ Iipjl 


forall <j <M. 
The GS algorithm can be used to obtain the “thin” Q) FR factorization. Indeed, if we 
define 


rig tax; rij * lipyl (8.6.41) 
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j-l 
we have Pj ="jj47 = Xj — > rij qi (8.6.42) 
i=l 
or by solving for x; 
j 
Sa ay felon M (8.6.43) 


i=l 
Using matrix notation, we can express this relation as X = QR, which is exactly the thin 
QiR factorization in (8.6.9). 

Major drawbacks of the GS procedure are that it does not produce accurate results and 
that the resulting basis may not be orthogonal when implemented using finite-precision 
arithmetic. However, we can achieve better numerical behavior if we reorganize the com- 
putations in a form known as the modified Gram-Schmidt (MGS) algorithm (Bjérck 1967). 
We start the first step by defining q; as before 


XI 
qi = —— (8.6.44) 
1X1 |] 


However, all the remaining vectors x2,..., Xj are modified to be orthogonal to q; by 
subtracting from each vector its projection onto qy, that is, 


x =x,-qlxjqi i=2,...,M (8.6.45) 


At the second step, we define the vector 
7= (8.6.46) 


which is already orthogonal to q;. Then we modify the remaining vectors to make them 
orthogonal to q2 
xP =x) — Gi xq 5 =3,...,M (8.6.47) 


Continuing in a similar manner, we compute q,,, and the updated vectors x? by 


(m—1) 
Xm 
qm = —a, (8.6.48) 
(m—1) 
Xm || 
and x0) = xD _ glx Dg, i=m+1,...,M (8.6.49) 


The MGS algorithm involves the following steps, outlined in Table 8.4 and is implemented 
by the function Q=mgs (x) . The superior numerical properties of the modified algorithm stem 


TABLE 8.4 
Orthogonalization of a set of vectors using the 
modified Gram-Schmidt algorithm. 


Modified GS Algorithm 


For m = | to M 
2 
rmm = ||Xmll 
Qm = Xm/Tmm 
Fori =m+1toM 
= ale: 
lmi — qn Xj 
Xi — Xi —Tmidm 
nexti 
next m 


from the fact that successive xe generated by (8.6.49) decrease in size and that the dot 
product qi aia can be computed more accurately than the dot product qi Xj. 


EXAMPLE 8.6.5. Consider an LS problem (Dahlquist and Bjérck 1974) with 


on oO = 
or 
ee | 


oon Re 


where €2 < 1, that is, €? can be neglected compared to 1. We first compute X?X and xly to 
determine the normal equations 


l+e 1 1 1 
1 I+e? 1 cj = | 1 
1 1 l+eé?2 1 


which provide the exact solution cj, = [1 1 aed /A+ e*): Numerically, the matrix X'X is 
singular on any computer with accuracy such that 1 + €? is rounded to 1. Applying the MGS 
algorithm to the column vectors of the augmented matrix [X_y], and taking into consideration 
that 1 + €? is rounded to 1, we obtain 


which corresponds to the thin QR decomposition. Solving Re}, = z, we obtain ec}, = [11 1"/ 3; 
which agrees with the exact solution under the assumption that | + €” is rounded to 1. 


8.7 LS COMPUTATIONS USING THE SINGULAR VALUE DECOMPOSITION 


The singular value decomposition (SVD) plays a prominent role in the theoretical analysis 
and practical solution of LS problems because (1) it provides a unified framework for the 
solution of overdetermined and underdetermined LS problems with full rank or that are 
rank-deficient and (2) it is the best numerical method to solve LS problems in practice. In 
this section, we discuss the existence and fundamental properties of the SVD, show how 
to use it for solving the LS problem, and apply it to determine the numerical rank of a 
matrix. More details are given in Golub and Van Loan (1996), Leon (1990), Stewart (1973), 
Watkins (1991), and Klema and Laub (1980). 


8.7.1 Singular Value Decomposition 


The eigenvalue decomposition reduces a Hermitian matrix to a diagonal matrix by premulti- 
plying and postmultiplying it by a single unitary matrix. The singular value decomposition, 
introduced in the next theorem, reduces a general matrix to a diagonal one by premultiplying 
and postmultiplying it by two different unitary matrices. 
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THEOREM 8.2. Any real N x M matrix X with rank r (recall that r is defined as the number of 
linearly independent columns of a matrix) can be written as 


X =Uxrv" (8.7.1) 


where U is an N x N unitary matrix, V is an M x M unitary matrix, and © is an N x M matrix 
with (2);; = 0,7 A j, and (Z);; =o; > 0,i = 1,2,...,7r. The numbers o; are known as the 
singular values of X and are usually arranged in decreasing orderaso, >02 > --- >o, >0. 


Proof. We follow the derivation given in Stewart (1973). Since the matrix XX is positive 
semidefinite, it has nonnegative eigenvalues o2,0 F can such thato, > 07 >--->o0;> 


0O=0,;4) =:::=oy forO<r< M.Let oy ede vw be the eigenvectors corresponding 
to the eigenvalues a, a, ted Oa. Consider the partitioning V = [V; V2], where Vj consists 
of the first r columns of V. If X, = diag{o1,02,...,0,}, then we obtain VEXHXvV, = yA 
and 
z, VE xX4xv,r,! =1 (8.7.2) 
Since VEX" Xv, = 0, we have 
XV> =0 (8.7.3) 
If we define 
U, £xv,z;! (8.7.4) 


then (8.7.2) gives ul U, = I; that is, the columns of Uj are unitary. A unitary matrix U £ 
[U, Us] is found by properly choosing the components of U2, that is, UFU, = Oand UFU, =I. 


Then 
s uf UPxv, Ul (Xvp) x, 0 
UTXV = X[V; Vo] = = (8.7.5) 
ul u#xv, Uf (XV2) 0 0 


because of (8.7.2), (8.7.3), and UF XV = (UZU))z, = 0. 


The SVD of a matrix, which is illustrated in Figure 8.13, provides a wealth of infor- 


mation about the structure of the matrix. Figure 8.14 provides a geometric interpretation of 
the SVD of a2 x 2 matrix X (see Problem 8.23 for details). 


+ =>—> 


Orthogonal Data Orthogonal Data FIGURE 8.13 
matrix matrix matrix rag Pictorial representation of 
the singular value 
decomposition of a matrix. 
<+r> 
<< -_-N—> <+M— 


Properties and interpretations. We next provide a summary of interpretations and 


properties whose proofs are given in the references and the problems. 


1; 


2. 


Postmultiplying (8.7.1) by V and equating columns, we obtain 
ojuj ($=152) 0037 
Xv; = ; (8.7.6) 
0 i=r+1,...,M 


that is, v; (columns of V) are the right singular vectors of X. 
Premultiplying (8.7.1) by U” and equating rows, we obtain 


wx = ojivit en ees (8.7.7) 
"10 i=r+1,...,N a 


that is, u; (columns of U) are the left singular vectors of X. 


ae The SVD of a 2 x 2 matrix 
maps the unit circle into an 
ellipse whose semimajor and 
1 semiminor axes are equal to 
the singular values of the 
matrix. 
vi = vy"! UH = U"! 
Rotation Rotation 
o2 
1 O 
z=[) 3 
1 oy 
M—_LLLL-” 
Stretching 


. Let A;(-) and o7(-) denote the ith largest eigenvalue and singular value of a given 


matrix, respectively. The vectors vj,..., Viv are eigenvectors of XxX; u,,..., Uy are 
eigenvectors of XX”, for which the squares of the singular values a, cee e of X are 
the first r nonzero eigenvalues of XX and Xx , that is, 

Ai CK#X) = A; KX") = 0? (X) (8.7.8) 


. In the product X = UxV", the last N — r columns of U and M —r columns of V 
are superfluous because they interact only with blocks of zeros in X. This leads to the 
following thin SVD representation of X 


X=U,2,V# (8.7.9) 


where U, and V, consist of the first r columns of U and V, respectively, and 
x, = diag {01,02,...,0;}. 
. The SVD can be expressed as 


. 
X= So oiuiv/! (8.7.10) 


that is, as a sum of cross products weighted by the singular values. 
. If the matrix X has rank r, then: 


a. The first r columns of U form an orthonormal basis for the space spanned by the 
columns of X (range space or column space of X). 

b. The first r columns of V form an orthonormal basis for the space spanned by the rows 
of X (range space of X" or row space of X). 

c. The last M — r columns of V form an orthonormal basis for the space of vectors 
orthogonal to the rows of X (null space of X). 

d. The last N — r columns of U form an orthonormal basis for the null space of X". 


. The Euclidean norm of X is 
|X|] = o1 (8.7.11) 


. The Frobenius norm of X, that is, the square root of the sum of the squares of its elements, 
is 


“NM. 
IXle = |) do lay? = fot+034---+03 (8.7.12) 


i=L j=) 
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9. The difference between the transformations implied by eigenvalue and SVD transfor- 
mations can be summarized as follows: 


Eigenvalue decomposition SVD 
R = QAQ”# X =UrV# 
xX x! 
Aq o| o| 
qi = qi vy 7 Ww ~~ 
2 o2 02 
q2 ad q2 VOR. Se MD Bo ND 
AM. Or Or 
qu qu Vr ? U; — Vr 
Vr+i Uy+1 
>0 : —>0 
VM uy 


This illustrates the need for left and right singular values and vectors. 


We can compute the SVD of a matrix X by forming the matrices XX and XX” and 
computing their eigenvalues and eigenvectors (see Problem 8.21). However, we should 
avoid this approach because the “squaring” of X to form these correlation matrices results 
in a loss of information (see Example 8.6.2). 

In practice, the SVD is computed by using the algorithm of Golub and Reinsch (1970) 
or the R-SVD algorithm described in Chan (1982), which for N >> M is twice as fast. The 
state of the art in SVD research is provided in Golub and Van Loan (1996), whereas reliable 
numerical algorithms and code are given in LA-PACK, LINPACK, and Numerical Recipes 
in C (Press et al. 1992). 


8.7.2 Solution of the LS Problem 


So far, we have discussed the solution of the overdetermined (V > M) LS problem with 
full-rank (r = M) data matrices using the normal equations and the QR decomposition 
techniques. We next show how the SVD can be used to solve the LS problem without 
making any assumptions about the dimensions N and M or the rank r of data matrix X. 

Suppose that we know the exact SVD of data matrix X = ULV". Since U is orthog- 
onal, 


lly — Xel] = lly — UZV"el] = ||U"y — EV" c| (8.7.13) 
If we define ysu%y ce A2vie 
we obtain the LSE 
r N 
lly — Xe? = lly’ — Ze’ = Doli oie + YO lyfl? (8.7.14) 
i=l i=r+l1 
which is minimized if and only if c} = y;/o; fori = 1,2,...,r. We notice that when 
r < M, the terms é es Ci do not appear in (8.7.14). Therefore, they have no effect 


on the residual and can be chosen arbitrarily. To illustrate this point, consider the geometric 
interpretation in Figure 8.5. There is only one linear combination of the linearly independent 
vectors x; and x2 that determines the optimum LS estimate. If the data matrix has one more 
column x3 that lies in the same plane, then there are an infinite number of linear combinations 
c1X1 + c2X2 + c3X3 that satisfy the LSE criterion. To obtain a unique LS solution from all 
solutions c that minimize ||y — Xc||, we choose the one with the minimum length ||¢||. Since 


the matrix V is orthogonal, we have ||e’|| = ||Ve|| = |le||, and the norm ||e|| is minimized 
when the norm |le’|| is minimized. Hence, choosing c’. qe cy = 0 provides 
the minimum-norm solution to the LS problem. In summary, the unique, minimum-norm 


solution to the LS problem is 


r H 
Uu. 
es = 0 iy, (8.7.15) 
api 
H 
Yi a we ol Pee 
where C= 490i Oj ies (8.7.16) 
0 i=r+l,...,M 
N N 
and E\s = lly— Xe? = D0 ly? = Do tuy? (8.7.17) 
i=r+l i=r+l1 


is the corresponding LS error. 

We next express the unique minimum-norm solution to the LS problem in terms of the 
pseudoinverse of data matrix X using the SVD. To this end, we note that (8.7.16) can be 
written in matrix form 


ce = dty’ (8.7.18) 
xy! 
where mrs k 7 i (8.7.19) 


is an N x N matrix with =! = diag {1/o,,...,1/o,}. Therefore, using (8.7.15) and 
(8.7.19), we obtain 


cj; = VETU"y = Xty (8.7.20) 
r 
1 
where xt Svrtu4 = 5° —vul (8.7.21) 
het 


is the pseudoinverse of matrix X. For full-rank matrices, the pseudoinverse is defined as 
Xt = (X4%X)~'X# (Golub and Van Loan 1996), so that using (8.7.21) leads to the LS 
solution in (8.2.21). If N = M = rank(X), then X+ = X~!. Therefore, (8.7.21) holds for 
any rectangular or square matrix that is either full rank or rank-deficient. Formally, X* can 
be defined independently of the LS problem as the unique M x N matrix A that satisfies 
the four Moore-Penrose conditions 


XAX = X (XA)# = XA 


8.7.22 
AXA=A  (AX)4 = AX ( ) 


which implies that XX* and XTX are orthogonal projections onto the range space of X and 
x# (see Problem 8.25). However, we stress that the pseudoinverse is, for the most part, a 
theoretical tool, and there is seldom any reason for its use in practice. 

In summary, the computation of the LS estimator using the SVD involves the steps 
shown in Table 8.5. The vector cj; is unique and satisfies two requirements: (1) It minimizes 
the sum of the errors, and (2) it has the smallest Euclidean norm. 

The following example illustrates the use of the SVD for the computation of the LS 
estimator. 


EXAMPLE 8.7.1. Solve the LS problem with the following data matrix and desired response 
signal: 


a 

ll 

A kd 

Now 
CorN Re 

RoR 

—— | 

t< 

| 

Tide, at 

Now 

—— | 
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a ae Solution of the LS problem using 


Least-Squares Filtering and the SVD method. 


Prediction 
Step Description 


Compute the SVD X = Uxvi 
2 Determine the rank r of X 
Compute ys = uly, i=1,..., N 


r / 


yi 
+ Compute cj, d - Vi 
N 
5 Compute E}, = > ly; |? 
i=r+l 


Solution. We start by computing the SVD of X = Uxv! by using the MATLAB function 
[U,S,V]=svd (X). This gives 
0.3041 0.2170 0.8329 0.4082 
0.4983 0.7771  —0.3844 0.0000 
0.7768 —0.4778 0.0409 —0.4082 
ae —0.3474 —0.3960 0.8165 


5.5338 0 0 t 
0.6989 0.3754 —0.60882 
0 1.5139 0 
Z= V =| —0.0063 0.8544 —0.5196 
0 0 0.2924 
—0.7152 0.3593 0.5994 
0 0 0 


which implies that the data matrix has rank r = 3. Next we compute 


5.1167 
3.0 
Dee aaa es = | -1.5 E\, = 15 
yen Y* 1 0.9602 a er a 
1.2247 


by the MATLAB commands 


yp=U' *y; 
cls=V* (yp(1l:r)./diag(S)); 
Els=sum(yp(r+1:N).°*2); 


which implement steps 3, 4, and 5 in Table 8.5. The LS solution also can be obtained from 
cls=x\y. If we set (X)23 = 2, the first and last columns of X become linearly dependent, the 
SVD has only two nonzero singular values, and the svd function warns that X is rank-deficient. 


Table 8.6 shows the numerical operations required by the various LS solution methods 
(Golub and Van Loan 1996). For full-rank (nonsingular) data matrices, all other methods 
are simpler than the SVD. However, these methods are inaccurate when X is rank-deficient 
(nearly singular). In such cases, the SVD reveals the near singularity of the data matrix and 
is the method of choice because it provides a reliable computation of the numerical rank 
(see the next section). 


Normal equations versus QR decomposition. The squaring of X to form the time- 
average correlation matrix R = XX results ina loss of information and should be avoided. 
Since ||X~!|| = 1/o min, the condition number of X is 


O me 
«(X) = [XI Xo' | = — (8.7.23) 


min 


TABLE 8.6 
Computational complexity of LS computation algorithms. 


LS Algorithm FLOPS (floating point operations) 
Normal equations NM? + M3 /3 
Householder orthogonalization 2NM2 — 2M3 /3 

Givens orthogonalization 3NM2 — M3 

Modified Gram-Schmidt 2NM? 

Golub-Reinsch SVD 4NM? + 8M3 

R-SVD 2NM? + 11M3 


which is analogous to the eigenvalue ratio for square Hermitian matrices. Hence, 


Hy, _ Amax _ Tmax _ 2 
K(X" X) = = —— =xk*(X) (8.7.24) 
Amin ae ‘ 
min 


which shows that squaring a matrix can only worsen its condition. 
The study of the sensitivity of the LS problem is complicated. However, the following 
conclusions (Golub and Van Loan 1996; Van Loan 1997) can be drawn: 


1. The sensitivity of the LS solution is roughly proportional to the quantity « (X) + ./E\,«7(X). 


Hence, any method produces inaccurate results when applied to ill-conditioned problems 
with large EF)s. 

2. The method of normal equations produces a solution ¢;, whose relative error is approx- 
imately eps - «*(X), where eps is the machine precision. 

3. The QR method (Householder, Givens, MGS) produces a solution cj}, whose relative 
error is approximately eps - [k(X) + Ej\sk2(X)]. 


In general, QR methods are more accurate than and can be used for a wider class of 
data matrices than the normal equations approach, even if the latter is about twice as fast. 

In many practical applications, we need to update the Cholesky or QR decomposition 
after the original data matrix has been modified by the addition or deletion of a row or column 
(rank 1 modifications). Techniques for the efficient computation of these decompositions 
by updating the existing ones can be found in Golub and Van Loan (1996) and Gill et al. 
(1974). 


8.7.3 Rank-Deficient LS Problems 


In theory, it is relatively easy to determine the rank of a matrix or that a matrix is rank- 
deficient. However, both tasks become complicated in practice when the elements of the 
matrix are specified with inadequate accuracy or the matrix is near singular. The SVD 
provides the means of determining how close a matrix is to being rank-deficient, which 
in turn leads to the concept of numerical rank. To this end, suppose that the elements 
of matrix X are known with an accuracy of order €, and its computed singular values 
0, >02>--:->oy are such that 


Ory +O ts toy <e (8.7.25) 


Then if we set X, & diag {o1,...,0,,0,..., 0} and 


x 
x, 4U ki ‘ yy" (8.7.26) 
we have IX—Xrllp = fo2,, +02,, +++ +03, <e (8.7.27) 


and matrix X is said to be near a matrix of rank r or X has numerical rank r. It can be 
shown that X,. is the matrix of rank r that is nearest to X in the Frobenius norm sense (Leon 
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1990; Stewart 1973). This result has important applications in signal modeling and data 
compression. 

Computing the LS solution for rank-deficient data matrices requires extra care. When a 
singular value is equal to a very small number, its reciprocal, which is a singular value of the 
pseudoinverse XT, is a very large number. As a result, the LS solution deviates substantially 
from the “true” solution. 

One way to handle this problem is to replace each singular value below a certain 
cutoff value (thresholding) with zero. A typical threshold is a fraction of 0; determined by 
either the machine precision available or the accuracy of the elements in the data matrix 
(measurement accuracy). For example, if the data matrix is accurate to six decimal places, 
we set the threshold at 10~°o; (Golub and Van Loan 1996). 

Another way is to replace the LS criterion (8.7.14) by 


E{e, w} = lly — Xell? + wllell* (8.7.28) 


where the constant yw > 0 reflects the importance of the norm of the solution vector. 
The term ||c|| acts a stabilizer, that is, prevents the solution cy from becoming too large 
(regularization). Indeed, using the method of Lagrange multipliers, we can show that 


: 
on H 
cy dX oy (ul y)y; (8.7.29) 
which is known as the regularized solution. We note that cy = cj; when yy = 0. However, 
when w > 0, as o; — O the term oi /(o7 + w) in (8.7.29) tends to zero while the term 
1/0; — oo in (8.7.15) tends to infinity. Furthermore, it can be shown that ||¢is.||_< |ly||/o+ 
and ||¢y || < llyll/‘v (Hager 1988). 

Since the minimum-norm LS solution requires only the first r columns of U, where r 
is the numerical rank of X, we can use the thin SVD. If N > M, the computation of either 
U,. or U is expensive. However, in practical SVD algorithms, U is computed as the product 
of many reflections and rotations. Hence, we can compute y’ = Uy by updating y at each 
step i with each orthogonal transformation, that is, ul yy. 


8.8 SUMMARY 


In this chapter we discussed the theory, implementation, and application of linear estimators 
(combiners, filters, and predictors) that are optimum according to the LSE criterion of 
performance. The fundamental differences between linear MMSE and LSE estimators are 
as follows: 


e MMSE estimators are designed using ensemble average second-order moments R and 
d; they can be designed prior to operation, and during their normal operation they need 
only the input signals. 

LSE estimators are designed using time-average estimates R and d of the second-order 
moments or data matrix X and the desired response vector y. For this reason LSE estima- 
tors are sometimes said to be data-adaptive. The design and operation of LSE estimators 
are coupled and are usually accomplished by using either of the following approaches: 


—Collect a block of training data Xj; and y;; and use them to design an LSE estimator; 
use it to process subsequent blocks. Clearly, this approach is meaningful if all 
blocks have statistically similar characteristics. 

—For each collected block of data X and y, compute the LSE filter c), or the LSE 
estimate y (whatever is needed). 


There are various numerical algorithms designed to compute LSE estimators and esti- 
mates. For well-behaved data and sufficient numerical precision, all these methods produce 


the same results and therefore provide the same LSE performance, that is, the same total 
squared error. 

However, when ill-conditioned data, finite precision, or computational complexity is a 
concern, the choice of the LS computational algorithm is very important. 

We saw that there are two major families of numerical algorithms for dealing with LS 
problems: 


Power-domain techniques solve LS estimation problems using the time-average mo- 
ments R = X”X andd = X” y. The most widely used methods are the LDL” and 
Cholesky decompositions. 

Amplitude-domain techniques operate directly on data matrix X and the desired re- 
sponse vector. In general, they require more computations and have better numerical 
properties than power-domain methods. This group includes the QR orthogonal- 
ization methods (Householder, Givens, and modified Gram-Schmidt) and the SVD 
method. 


The QR decomposition methods apply a unitary transformation to the data matrix to 
reduce it to an upper triangular one, whereas the GS methods apply an upper triangular 
matrix transformation to orthogonalize the columns of the data matrix. 

In conclusion, we emphasize that there are various ways to compute the coefficients of 
an optimum estimator and the value of the optimum estimate. We stress that the performance 
of any optimum estimator, as measured by the MMSE or LSE, does not depend on the 
particular implementation as long as we have sufficient numerical precision. Therefore, if 
we want to investigate how well an optimum estimator performs in a certain application, we 
can use any implementation, as long as computational complexity is not a consideration. 


PROBLEMS 


8.1 By differentiating (8.2.8) with respect to the vector c, show that the LSE estimator ¢), is given 
by the solution of the normal equations (8.2.12). 


8.2 Let the weighted LSE be given by Ey = e/ We, where W is a Hermitian positive definite 
matrix. 


(a) By minimizing Ey, with respect to the vector c, show that the wieghted LSE estimator is 
given by (8.2.35). 

(b) Using the LDL# decomposition W = LDL", show that the weighted LS criterion corre- 
sponds to prefiltering the error or the data. 


8.3. Using direct substitution of (8.4.4) into (8.4.5), show that the LS estimator of? and the associated 
LS error EY are determined by (8.4.5). 


8.4 Consider a linear system described by the difference equation y(n) = 0.9y(n — 1)+0.1x 
(n — 1) + v(n), where x(n) is the input signal, y(7) is the output signal, and v(7) is an output 
disturbance. Suppose that we have collected N = 1000 samples of input-output data and that we 
wish to estimate the system coefficients, using the LS criterion with no windowing. Determine 
the coefficients of the model y(n) = ay(n — 1) + dx(n — 1) and their estimated covariance 
matrix 62R~! when 
(a) x(n) ~ WGN(O, 1) and v(n) ~ WGN(O, 1) and 
(b) x(n) ~ WGN(O, 1) and v(n) = 0.8v(n — 1) + w(n) is an AR(1) process with w(n) ~ 

WGN (0, 1). Comment upon the quality of the obtained estimates by comparing the matrices 
62R7! obtained in each case. 


8.5 Use Lagrange multipliers to show that Equation (8.4.13) provides the minimum of (8.4.8) under 
the constraint (8.4.9). 
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8.10 
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If full windowing is used in LS, then the autocorrelation matrix is Toeplitz. Using this fact, show 
that in the combined FBLP the predictor is given by 


alb — 5 (a + Jb*) 


Consider the noncausal “middle” sample linear signal estimator specified by (8.4.1) with M = 
2L andi = L. 


(a) Show that if we apply full windowing to the data matrix, the resulting signal estimator is 
conjugate symmetric, that is, ce) = JeL*, This property does not hold for any other 
windowing method. 

(b) Derive the normal equations for the signal estimator that minimizes the total squared error 
EY) = je) \|2 under the constraint ce) = Je)*, 

(c) Show that if we enforce the normal equation matrix to be centro-Hermitian, that is, we use 
the normal equations 


0 
(KIX 4 JX X* Ye = | von 
0 


then the resulting signal smoother is conjugate symmetric. 
(d) Illustrate parts (a) to (c), using the data matrix 


al 
Il 
w 
YN OF Ne 
w 


and check which smoother provides the smallest total squared error. Try to justify the 
obtained answer. 


A useful impulse response for some geophysical signal processing applications is the Mexican 
hat wavelet 


e() = Gra — PyeWP/? 


which is the second derivative of a Gaussian pulse. 


(a) Plot the wavelet g(t) and the magnitude and phase of its Fourier transform. 

(b) By examining the spectrum of the wavelet, determine a reasonable sampling frequency F's. 

(c) Design an optimum LS inverse FIR filter for the discrete-time wavelet g(nT), where T = 
1/F;. Determine a reasonable value for M by plotting the LSE Ey as a function of order 
M. Investigate whether we can improve the inverse filter by introducing some delay ng. 
Determine the best value of ng and plot the impulse response of the resulting filter and the 
combined impulse response g(n) * h(n — ng), which should resemble an impulse. 

(d) Repeat part (c) by increasing the sampling frequency by a factor of 2 and comparing with 
the results obtained in part (c). 


(a) Prove Equation (8.5.4) regarding the LDL? decomposition of the augmented matrix R. _ 
(b) Solve the LS estimation problem in Example 8.5.1, using the LDL? decomposition of R 
and the partitionings in (8.5.4). 


Prove the order-recursive algorithm described by the relations given in (8.5.12). Demonstrate 
the validity of this approach, using the data in Example 8.5.1. 


In this problem, we wish to show that the statistical interpretations of innovation and partial 
correlation for wm (n) and k,,+1 in (8.5.12) hold in a deterministic LSE sense. To this end, 
suppose that the “partial correlation” between y and X,,, +, is defined using the residual records 
@n = Y — Xmem and @> = X41 + Xmbm. where bm is the LSE BLP. Show that ky4) = 
Bm+i/Em41, Where Bin +1 = eed and €n41 = eH eb Demonstrate the validity of these 
formulas using the data in Example 8.5.1. 


8.12 


8.13 


8.14 


8.15 


8.16 


8.17 


8.18 


8.19 


Show that the Cholesky decomposition of a Hermitian positive definite matrix R can be com- 
puted by using the following algorithm 


for 7 =1toM 
jv! 


ly=Ci- >> [2jxl?)!/* 
k=1 


fori=j+1toM 
Jat 


— oe * . se 
hig p= ~~ Ligh jk)/ ij 
k=1 
end i 
end j 
and write a MATLAB function for its implementation. Test your code using the built-in MATLAB 


function chol. 


Compute the LDL? and Cholesky decompositions of the following matrices: 


9 3 -6 6 4 -2 
X, = 3 4 1 and Xo = 4 5 3 
-6 1 9 =e 6 


Solve the LS problem in Example 8.6.1, 


(a) using the QR decomposition of the augmented data matrix X = [X y] and 
(b) using the Cholesky decomposition of the matrix R = X“X. 
Note: Use MATLAB built-in functions for the QR and Cholesky decompositions. 


(a) Show that a unit vector w is an eigenvector of the matrix H = I — 2wwZ. What is the 
corresponding eigenvalue? 

(b) Ifavector zis orthogonal to w, show that z is an eigenvector of H. What is the corresponding 
eigenvalue? 


Solve the LS problem 


using the Householder transformation. 
Solve Problem 8.16 by using the Givens transformation. 


Compute the QR decomposition of the data matrix 


ae | 
2.0 A 
x = 

20 -1 
li 2 1 


using the GS and MGS methods, and compare the obtained results. 


Solve the following LS problem 


lo ele 


by computing the QR decomposition using the GS algorithm. 
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8.20 


8.21 


8.22 


8.23 


8.24 


8.25 


8.26 


Show that the computational organization of the MGS algorithm shown in Table 8.4 can be used 
to compute the GS algorithm if we replace the step rj), = qi Xm by rim = qi qm. 


Lael 
Compute the SVD of X = | I 1 by computing the eigenvalues and eigenvectors of X#X and 


XX. Check with the results obtained using the svd function. 


Repeat Problem 8.21 for 


X= es d 
(a) X= ug an 


(b) X Oo 1 1 

fl 1 of 
Write a MATLAB program to produce the plots in Figure 8.14, using the matrix X = [ 
Hint: Use a parametric description of the circle in polar coordinates. 


4+ él: 


é & 
For the matrix X = ki i al determine Xt and verify that X and X* satisfy the four 


Moore-Penrose conditions (8.7.22). 


Prove the four Moore-Penrose conditions in (8.7.22) and explain why XX* and X*X are 
orthogonal projections onto the range space of X and x, 


In this problem we examine in greater detail the radio-frequency interference cancelation ex- 
periment discussed in Section 8.4.3. We first explain the generation of the various signals and 
then proceed with the design and evaluation of the LS interference canceler. 


(a) The useful signal is a pointlike target defined by 


I a agit) 
S(t) = = 
dt (<a —arr) dt 
where a = 2.3, t = 0.4, and tr = 2. Given that F; = 2 GHz, determine s(n) by 
computing the samples g(n) = g(nT) in the interval —2 < nT < 6ns and then computing 
the first difference s(n) = g(n) — g(n — 1). Plot the signal s(n) and its Fourier transform 
(magnitude and phase), and check whether the pointlike and wideband assumptions are 
justified. 
(b) Generate N = 4096 samples of the narrowband interference using the formula 


L 
z(n) = > Aj sin (ojn + $;) 
i=l 


and the following information: 


Fs=2; % All frequencies are measured in GHz. 

F=0.1*[0.6 11.8 2.1 3 4.8 5.2 5.7 6.1 6.4 6.7 7 7.8 9.3]'; 
L=length (F) ; 

om=2*pi*F/Fs; 

A=(0.5 110.5 0.1 0.3 0.5111 0.5 0.3 1.5 0.5)’; 
rand(’seed’,1954); 

phi=2*pi*rand(L,1); 

(c) Compute and plot the the periodogram of z(n) to check the correctness of your code. 

(d) Generate N samples of white Gaussian noise v(n) ~ WGN (0, 0.1) and create the ob- 
served signal x(n) = 5s(n — ng) + z(n) + v(m), where ng = 1000. Compute and plot the 
periodogram of x(n). 

(e) Design a one-step ahead (D = 1) linear predictor with M = 100 coefficients using the 
FBLP method with no windowing. Then use the obtained FBLP to clean the corrupted 
signal x(n) as shown in Figure 8.7. To evaluate the performance of the canceler, generate 
the plots shown in Figures 8.8 and 8.9. 


8.27 


8.28 


8.29 


8.30 


Careful inspection of Figure 8.9 indicates that the the D-step prediction error filter, that is, the 
system with input x(m) and output e!(n), acts as a whitening filter. In this problem, we try 
to solve Problem 8.26 by designing a practical whitening filter using a power spectral density 
(PSD) estimate of the corrupted signal x(n). 


(a) Estimate the PSD REA (e/k), wk = 20k/Negrz, of the signal x(7), using the method of 
averaged periodograms. Use a segment length of L = 256 samples, 50 percent overlap, and 
Nerpr = 512. 

(b) Since the PSD does not provide any phase information, we shall design a whitening FIR 
filter with linear phase by 


7 1 _ 20 Nepr-! 
A(k) = = ¢ Ur 2 _* 


V RY (eier) 


where H(k) is the DFT of the impulse response of the filter, that is, 


y Nerr—1 ea 
Aik)= > hinje I Neer 
n=0 


k 


withO<k< NFFT — 1. 

(c) Use the obtained whitening filter to clean the corrupted signal x(n), and compare its per- 
formance with the FBLP canceler by generating plots similar to those shown in Figures 8.8 
and 8.9. 

(d) Repeat part (c) with L = 128, Nepp = 512 and L = 512, Negr = 1024 and check whether 
spectral resolution has any effect upon the performance. Note: Information about the design 
and implementation of FIR filters using the DFT can be found in Proakis and Manolakis 
(1996). 


Repeat Problem 8.27, using the multitaper method of PSD estimation. 


In this problem we develop an RFI canceler using a symmetric linear smoother with guard 
samples defined by 
M 
e(n) = x(n) — &(n) 2 x(n) + > lexan —K) +x +-b)] 
k=D 
where 1 < D < M prevents the use of the D adjacent samples to the estimation of x(n). 


(a) Following the approach used in Section 8.4.3, demonstrate whether such a canceler can be 
used to mitigate RFI and under what conditions. 

(b) If there is theoretical justification for such a canceler, estimate its coefficients, using the 
method of LS with no windowing for M = 50 and D = 1 for the situation described in 
Problem 8.26. 

(c) Use the obtained filter to clean the corrupted signal x(n), and compare its performance with 
the FBLP canceler by generating plots similar to those shown in Figures 8.8 and 8.9. 

(d) Repeat part (c) for D = 2. 


In Example 6.7.1 we studied the design and performance of an optimum FIR inverse system. In 
this problem, we design and analyze the performance of a similar FIR LS inverse filter, using 
training input-output data. 


(a) First, we generate N = 100 observations of the input signal y(n) and the noisy output signal 
x(n). We assume that y(n) ~ WGN(0, 1) and v(n) ~ WGN(O, 0.1). To avoid transient 
effects, we generate 200 samples and retain the last 100 samples to generate the required 
data records. 

(b) Design an LS inverse filter with M = 10 for 0 < D < 10, using no windowing, and choose 
the best value of delay D. 

(c) Repeat part (b) using full windowing. 

(d) Compare the LS filters obtained in parts (b) and (c) with the optimum filter designed in 
Example 6.7.1. What are your conclusions? 
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Prediction (a) Generate N = 1000 samples of input-desired response data {x(n), a(n)} Bs —! and use them 


to estimate the correlation matrix R, and the cross-correlation vector d between x(n) and 
y(n — D). Use D = 7, M = 11, and W = 2.9. Solve the normal equations to determine 
the LS FIR equalizer and the corresponding LSE. 

(b) Repeat part (a) 500 times; by changing the seed of the random number generators, compute 
the average (over the realizations) coefficient vector and average LSE, and compare with 
the optimum MSE equalizer obtained in Example 6.8.1. What are your conclusions? 

(c) Repeat parts (a) and (b) by setting W = 3.1. 


CHAPTER 9 


Signal Modeling and Parametric 
Spectral Estimation 


This chapter is a transition from theory to practice. It focuses on the selection of an ap- 
propriate model for a given set of data, the estimation of the model parameters, and how 
well the model actually “fits the data.” Although the development of parameter estimation 
techniques requires a strong theoretical background, the selection of a good model and its 
subsequent evaluation require the user to have sufficient practical experience and a famil- 
larity with the intended application. We provide complete, detailed algorithms for fitting 
pole-zero models to data using least-squares techniques. The estimation of all-pole model 
parameters involves the solution of a linear system of equations, whereas pole-zero mod- 
eling requires nonlinear least-squares optimization. The chapter is roughly organized into 
two separate but related parts. 

In the first part, we begin in Section 9.1 by explaining the steps that are required 
in the model-building process. Then, in Section 9.2, we introduce various least-squares 
algorithms for the estimation of parameters of direct and lattice all-pole models, provide 
different interpretations, and discuss some order selection criteria. For pole-zero models we 
provide, in Section 9.3, a nonlinear optimization algorithm that estimates the parameters of 
the model by minimizing the least-squares criterion. We conclude this part with Section 9.4 
in which we discuss the applications of pole-zero models to spectral estimation and speech 
processing. 

In the second part, we begin with the method of minimum-variance spectral estimation 
(Capon’s method). Then we describe frequency estimation methods based on the harmonic 
model: the Pisarenko harmonic decomposition and the MUSIC, minimum-norm, and ES- 
PRIT algorithms. These methods are suitable for applications in which the signals of interest 
can be represented by complex exponential or harmonic models. Signals consisting of com- 
plex exponentials are found in a variety of applications including as formant frequencies 
in speech processing, moving targets in radar, and spatially propagating signals in array 
processing. 


9.1 THE MODELING PROCESS: THEORY AND PRACTICE 
In this section, we discuss the modeling of real-world signals using parametric pole-zero 
(PZ) signal models, whose theoretical properties were discussed in Chapter 4. We focus 


on PZ (P, Q) models with white input sequences, which are also known as ARMA (P, Q) 
random signal models. These models are defined by the linear constant-coefficient difference 
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equation 


P 


Q 
x(n) = — > ayx(n — k) + w(n) + >) dw(n — k) (9.1.1) 


k=1 k=1 


where w(n) ~ WN (0, o*) with e < oo. The power spectral density (PSD) of the output 
signal is 


0 7) 

1+ > dye ok 
R jo, _ 2 k=1 ae) |D(eJ®) (7 9.1.2 
(e ) => Ow P . => Ow |Ae—J*)y[2 ( os Be ) 

1+ » ape Jok 
k=1 

which is a rational function completely specified by the parameters, {a1,a2,..., ap}, 
{d),...,dg}, and Ge We stress that since these models are linear, time-invariant (LTT), 


the resulting process x(n) is stationary, which is ensured if the corresponding systems are 
BIBO stable. 

The essence of signal modeling and of the resulting parametric spectrum estimation 
is the following: Given finite-length data {x (nN), which can be regarded as a sample 
sequence of the signal under consideration, we want to estimate signal model parame- 
ters { ay}P { bye, and coe to satisfy a prescribed criterion. Furthermore, if the parameter 
estimates are sufficiently accurate, then the following formula 


QO . 2 
14+) deiok 
R(e/”) = 62 k=1 = 


P 
1+) > ae~iok 
k=1 


should provide a reasonable estimate of the signal PSD. A similar argument applies to 
harmonic signal models and harmonic spectrum estimation in which the model parameters 
are the amplitudes and frequencies of complex exponentials (see Section 3.3.6). 

The development of such models involves the steps shown in Figure 9.1. In this chapter, 
we assume that we have removed trends, seasonal variations, and other nonstationarities 
from the data. We further assume that unit poles have been removed from the data by using 
the differencing approach discussed in Box et al. (1994). 


2 |D(e4”)/? 
az |Dle 7) 


* eam (9.1.3) 


Model selection 


In this step, we basically select the structure of the model (direct or lattice), and we make 
a preliminary decision on the orders P and Q of the model. The most important aid to model 
selection is the insight and understanding of the signal and the physical mechanism that 
generates it. Hence, in some applications (e.g., speech processing) physical considerations 
point to the type and order of the model; when we lack a priori information or we have 
insufficient knowledge of the mechanism generating the signal, we resort to data analysis 
methods. 

In general, to select a candidate model, we estimate the autocorrelation, partial au- 
tocorrelation, and power spectrum from the available data, and we compare them to the 
corresponding quantities obtained from the theoretical models (see Table 4.1). This prelim- 
inary data analysis provides sufficient information to choose a PZ model and some initial 
estimate for P and Q to start a model building process. Several order selection criteria 
have been developed that penalize both model misfit and a large number of parameters. 
Although theoretically interesting and appealing, these criteria are of limited value when 
we deal with actual signals. 


Choose model FIGURE 9.1 


Stage 1 structure and Steps in the signal model building process. 
Model selection order 


Estimate model 
parameters 


Stage 2 
Model estimation 


Check the 
candidate model 
for performance 


Stage 3 
Model validation 


Is model 
satisfactory? 


Use the 


model for 
your application 


The model structure influences (1) the complexity of the algorithm that estimates the 
model parameters and (2) the shape of the criterion function (quadratic or nonquadratic). 
Therefore, the structure (direct or lattice) is not critical to the performance of the model, 
and its choice is not as crucial as the choice of the order of the model. 


Model estimation 


In this step, also known as model fitting, we use the available data {x(n)}q —! to esti- 
mate the parameters of the selected model, using optimization of some criterion. Although 
there are several criteria (e.g., maximum likelihood, spectral matching) that can be used 
to measure the performance or quality of a PZ model, we concentrate on the least-squares 
(LS) error criterion. As we shall see, the estimation of all-pole (AP) models leads to linear 
optimization problems whereas the estimation of all-zero (AZ) and PZ models requires 
the solution of nonlinear optimization problems. Parameter estimation for PZ models using 
other criteria can be found in Kay (1988), Box et al. (1994), Porat (1994), and Ljung (1987). 


Model validation 


Here we investigate how well the obtained model captures the key features of the 
data. We then take corrective actions, if necessary, by modifying the order of the model, 
and repeat the process until we get an acceptable model. The goal of the model validation 
process is to find out whether the model 


e Agrees sufficiently with the observed data 
e Describes the “true” signal generation system 
e Solves the problem that initiated the design process 


Of course, the ultimate test is whether the model satisfies the requirements of the intended 
application, that is, the objective and subjective criteria that specify the performance of the 
model, computational complexity, cost, etc. In this discussion, we concentrate on how well 
the model fits the observed data in an LS error statistical sense. 

The existence of any structure in the residual or prediction error signal indicates a misfit 
between the model and the data. Hence, a key validation technique is to check whether the 
residual process, which is generated by the inverse of the fitted model, is a realization of 
white noise. This can be checked by using, among others, the following statistical techniques 
(Brockwell and Davis 1991; Bendat and Piersol 1986): 
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Autocorrelation test. It can be shown (Kendall and Stuart 1983) that when N is suffi- 
ciently large, the distribution of the estimated autocorrelation coefficients 6(/) = r(J)/r (0) 
is approximately Gaussian with zero mean and variance of 1/N. The approximate 95 per- 
cent confidence limits are +1.96/./N. Any estimated values of (J) that fall outside these 
limits are “significantly” different from zero with 95 percent confidence. Values well beyond 
these limits indicate nonwhiteness of the residual signal. 


N-1 


Power spectrum density test. Given a set of data {x(n)}),_9 , the standardized cumu- 


lative periodogram is defined by 


0 k<1 
k 
So Re) 
ims) = 1<k<K (9.1.4) 
yi Re’) 
i=l 
1 k>kK 


where K is the integer part of N’/2. If the process x(n) is white Gaussian noise (WGN), then 
the random variables J (k),k = 1,2,..., K, are independently and uniformly distributed 
in the interval (0, 1), and the plot of I(k) should be approximately linear with respect to k 
(Jenkins and Watts 1968). The hypothesis is rejected at level 0.05 if I (k) exits the boundaries 
specified by 

k-1 


TO) = eae 1.36(K — 1)7!/? bekeK (9.1.5) 


Partial autocorrelation test. This test is similar to the autocorrelation test. Given the 
residual process x (7), it can be shown (Kendall and Stuart 1983) that when N is sufficiently 
large, the partial autocorrelation sequence (PACS) values {k;} for lag / [defined in (4.2.44)] 
are approximately independent with distribution WN (0, 1/N). This means that roughly 
95 percent of the PACS values fall within the bounds +1.96//N. If we observe values 
consistently well beyond this range for N sufficiently large, it may indicate nonwhiteness 
of the signal. 


EXAMPLE 9.1.1. To apply the above tests and interpret their results, we consider a WGN sequence 
x(n). By using the randn function, 100 samples of x(n) with zero mean and unit variance were 
generated. These samples are shown in Figure 9.2. From these samples, the autocorrelation 
estimates up to lag 40, denoted by {7 (/ ae were computed using the autoc function, from 
which the the correlation coefficients 6(/) were obtained. The first 10 coefficients are shown in 
Figure 9.2 along with the appropriate confidence limits. As expected, the first coefficient at lag 
0 is unity while the remaining coefficients are within the limits. 

Next, using the psd function, a periodogram based on 100 samples was computed, from 
which the cumulative periodogram /(k) was obtained and plotted as a function of the normal- 
ized frequency, as shown in Figure 9.2. The confidence limits are also shown. The computed 
cumulative periodogram is a monotonic increasing function lying within the limits. 

Finally, using the durbin function, PACS sequence (aye, was computed from the esti- 
mated correlations and plotted in Figure 9.2. Again all the values for lags / > 1 are within the 
confidence limits. Thus all three tests suggest that the 100-point data are almost surely from a 
white noise sequence. 


Although the whiteness of the residuals is a good test for model fitting, it does not 
provide a definite answer to the problem. Some additional procedures include checking 
whether 


e The criterion of performance decreases (fast enough) as we increase the order of the 
model. 
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Validation tests on white Gaussian noise in Example 9.1.1. 


e The estimate of the variance of the residual decreases as the number N of observations 
increases. 

e Some estimated parameters that have physical meaning (e.g., reflection coefficients) 
assume values that make sense. 

e The estimated parameters have sufficient accuracy for the intended application. 


Finally, to demonstrate that the model is sufficiently accurate for the purpose for which it 
was designed, we can use a method known as cross-validation. Basically, in cross-validation 
we use one set of data to fit the model and another, statistically independent set of data to 
test it. Cross-validation is of paramount importance when we build models for control, fore- 
casting, and pattern recognition (Ljung 1987). However, in signal processing applications, 
such as spectral estimation and signal compression, where the goal is to provide a good fit 
of the model to the analyzed data, cross-validation is not as useful. 


9.2 ESTIMATION OF ALL-POLE MODELS 


We next use the principle of least squares to estimate parameters of all-pole signal models 
assuming both white and periodic excitations. We also discuss criteria for model order selec- 
tion, techniques for estimation of all-pole lattice parameters, and the relationship between 
all-pole estimation methods using the methods of least squares and maximum entropy. The 
relationship between all-pole model estimation and minimum-variance spectral estimation 
is explored in Section 9.5. 


9.2.1 Direct Structures 


Consider the AR(Po) model where we use a; instead of ax to comply with Chapter 8 
notation. 


Po 
x(n) = — ) afx(n —k) + w(n) (9.2.1) 
k=1 
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where w(n) ~ WN(0, o2,). The Pth-order linear predictor of x(7) is given by 
P 
&(n) =— > ajfx(n —k) (9.2.2) 
k=1 
and the corresponding prediction error sequence is 


P 
e(n) = x(n) — £(n) = x(n) +) ax(n — bk) (9.2.3) 
k=1 
= a"x(n) (9.2.4) 
where do = 1 and 

a=[1da --- ap]’ (9.2.5) 
x(n) = [x(n) x(n — 1) «+ x(n — PY] (9.2.6) 

Thus the error over the range Nj < n < N¢ can be expressed as a vector 
e= Xa (9.2.7) 


where X is the data matrix defined in (8.4.3). For the full-windowing case, the data matrix 
X is given by 


x(0) x(1) --- x(P) war 0 et O 

> 0 x(0) --- x(P—-1) --- x(N-1) --- O 

XM = |. : mS or oot (9.2.8) 
0 0 -++ x(0) s+» xX(N—P) +--+. x(N-1) 

while for the no-windowing case the data matrix X is 

x(P) x(P +1) ++» x(N—2) x(N — 1) 

= x(P—1) x(P) +++ x(N — 3) x(N — 2) 

KF =| a7 (9.2.9) 
x(0) x(1) so) xX(N-—P-—2) x(N-P-1) 


Notice that if P = Po and ay = ax, the prediction error e(n) is identical to the white noise 
excitation w(n). Furthermore, if AR(Po) is minimum-phase, then w(7) is the innovation 
process of x(n) and x(n) is the MMSE prediction of x(n). Thus, we can obtain a good 
estimate of the model parameters by minimizing some function of the prediction error. 

In theory, we minimize the MSE EF {|e(n) 7}. In practice, since this is not possible, we 
estimate {a}? for a given P by minimizing the total squared error 


Ne Ne P 2 
Ep= > lem = Yo |x) + Do afxtn —b&) (9.2.10) 
n=N, n=N, k=1 
Ne 
= )0 a! x(n)? = a" xX" xa (9.2.11) 
n=Nj 


over the range N; <n < Ne. Hence, we can use the methods discussed in Section 8.4 for the 
computation of LS linear predictors. In particular, the forward linear predictor coefficient 
{ax} ie , and the associated LS error €p are obtained by solving the normal equations 


(X"4X)a = BK (9.2.12) 


The solution of (9.2.12) is discussed extensively in Chapter 8. 


The least-squares AP(P) parameter estimates have properties similar to those of linear 
prediction. For example, if the process w(n) is Gaussian, the least-squares no-windowing 
estimates are also maximum-likelihood estimates (Jenkins and Watts 1968). The variance 
of the excitation process can be obtained from the LS error E p by 


a2 1 : 1 N+P-1 ‘ . . 
Cw N+ per “N+P Xu le(n)| full windowing = (9.2.13) 
52 a: _— 2 : ; 

or Ci > a ae 7 ap dX le(n)| no windowing (9.2.14) 


for the full-windowing or no-windowing methods, respectively. Furthermore, in the full- 
windowing case, if the Toeplitz correlation matrix is positive definite, the obtained model 
is guaranteed to be minimum-phase (see Section 7.4). MATLAB functions 


[ahat,e,V] = arwin(x,P) and [ahat,e,V] = arls(x,P) 


are provided that compute the model parameters, the error sequence, and the modeling error 
using the full-windowing and no-windowing methods, respectively. 

We present three examples below to illustrate the all-pole model determination and 
its use in PSD estimation. The first example uses real data consisting of water-level mea- 
surements of Lake Huron from 1875 to 1972. The second example also uses real data 
containing sunspot numbers for 1770 through 1869. These sunspot numbers have an ap- 
proximate cycle of period around 10 to 12 years. The Lake Huron and sunspot data are 
shown in Figure 9.3. The third example generates simulated AR (4) data to estimate model 
parameters and through them the PSD values. In each case, the mean was computed and 
removed from the data prior to processing. 
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The Lake Huron and sunspot data used in Examples 9.2.1 and 9.2.2. 


EXAMPLE 9.2.1. A careful examination of Lake Huron water-level measurement data indicates 
that a low-order all-pole model might be a suitable representation of the data. To test this hypoth- 
esis, first- and second-order models were considered. Using the full-windowing method, model 
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CHAPTER 9 fa a Bi a AQ 
Signal Modeling and First-order a, = —0.791, Ow = 0.5024 
Parametric Spectral Second-order a, = —1.002, ay = 0.2832, a = 0.4460 


Estimation 
Using these model parameters, the data were filtered and the residuals were computed. Three 


tests for checking the whiteness of the residuals as described in Section 9.1 were performed to 
ascertain the validity of models. In Figure 9.4, we show the residuals, the autocorrelation test, 
the PSD test, and the partial correlation test for the first-order model. The partial correlation test 
indicates that the PACS coefficient at lag 1 is outside the confidence limits and thus the first-order 
model is a poor fit. In Figure 9.5 we show the same plots for the second-order model. Clearly, 
these tests show that the residuals are approximately white. Therefore, the AR(2) model appears 
to be a good match to the data. 
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Validation tests on the first-order model fit to the Lake Huron water-level measurement data in 
Example 9.2.1. 


EXAMPLE 9.2.2. Figure 9.6 shows the PACS coefficients of the sunspot numbers along with 
the 95 percent confidence limits. Since all PACS values beyond lag 2 fall well inside the limits, 
a second-order model is a possible candidate for the data. Therefore, the second-order model 
parameters were estimated from the data to obtain the model 


) 
x(n) = 1.318x(n — 1) — 0.634x(n—2) + w(n) G2, = 289.2 


In Figure 9.7 we show the residuals obtained by filtering the data along with three tests for its 
whiteness. The plots show that the estimated model is a reasonable fit to the data. Finally, in 
Figure 9.8 we show the PSD estimated from the AR(2) model as well as from the periodogram. 
The periodogram is very noisy and is devoid of any structure. The AR(2) spectrum is smoother 
and distinctly shows a peak at 0.1 cycle per sampling interval. Since the sampling rate is 1 
sampling interval per year, the peak corresponds to 10 years per cycle, which agrees with the 
observations. Thus the parametric approach to PSD estimation was appropriate. 


EXAMPLE 9.2.3. We illustrate the least-squares algorithms described above, using the AR(4) 
process x(n) introduced in Example 5.3.2. The system function of the model is given by 
1 


H = 
@) 1 — 2.7607z—! + 3.8106z—2 — 2.6535z—3 + 0.9238z—4 


and the excitation is a zero-mean Gaussian white noise with unit variance. Suppose that we are 453 


given the N = 250 samples of x(n) shown in Figure 9.9 and we wish to model the underlying 
process by using an all-pole model. To identify a candidate model, we compute the autocor- 
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relation, partial autocorrelation, and periodogram, using the available data. Careful inspection Models 
of Figure 9.9 and the signal model characteristics given in Table 4.1 suggests an AR model. 
Since the PACS plot cuts off around P = 5, we choose P = 4 and fit an AR(4) model to the 
data, using both the full-windowing and no-windowing methods. Figure 9.10 shows the actual 
spectrum of the process, the spectra of the estimated models, and the periodogram. Clearly, the 
no-windowing estimate provides a better fit because it does not impose any windowing on the 
data. Figure 9.11 shows the residual, autocorrelation, partial autocorrelation, and periodogram 
for the no-windowing-based model. We see that the residuals can be assumed uncorrelated 
with reasonable confidence, which implies that the model captures the second-order statistics of 
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Validation tests on the second-order model fit to the Lake Huron water-level measurement data in 


Example 9.2.1. 
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The PACS values of the sunspot numbers in Example 9.2.2. 
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e(n) 


Modified covariance method. The LS method described above to estimate model 
parameters uses the forward linear predictor and prediction error. There is also another 
approach that is based on the backward linear predictor. Recall that the backward linear 
predictor derived from the known correlations is the complex conjugate of the forward 
predictor (and likewise, the corresponding errors are identical). However, the LS estimators 
and errors based on the actual data are different because the data read in each direction are 
different from a statistical viewpoint. Hence, it is much more reasonable to consider both 
forward and backward predictors and to minimize the combined error 


Ne 
Ep = Yo Ulet (a)? + le@)I71 
n=N, 
Ne 
= Dolla" xa)? + ax") /7] (9.2.15) 
n=N, 


= aX" Xa + a7 XK XA 
subject to the constraint that the first component of a is 1. The minimization of ee leads 
to the set of normal equations 
se, Re étb 
(X4X4xX'xa-—| 7 (9.2.16) 
0 


which can be solved efficiently to obtain the model parameters (see Section 8.4.2). This 
method of using the forward-backward predictors is called the modified covariance method. 
Not only does it have the advantage of minimizing the combined global error, but also since 
it uses more data in (9.2.16), it gives better estimates and lower error. A similar minimization 
approach, but implemented at each local stage, is used in Burg’s method, which is discussed 
in Section 9.2.2. 
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Validation tests on the second-order model fit to the sunspot numbers in Example 9.2.2. 
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Comparison of the periodogram and the AR(2) spectrum in 
Example 9.2.2. 


AR(4) Autocorrelation 
40 + 
al ell m 
= | S 9 Lethe , anmaeee OMe get?! 
= 0 1 =~ 09 Foy Sarr Ea rete 
E 
< -207 -0.5 7 
a0 L i 1 -1.0 C4 1 i i i 4 
0 100 200 0 5 10 15 20 
Sample number Lag / 
PACS Periodogram 
1.0F T T 100 
L oOo 
ve | I S 50 
Eo [Ll Teese ete eeng tsa: g 
fo} 
-0.5 + a 0 
-1 0 C 1 1 50 : 
0 5 10 0 0.1 0.2 0.3 0.4 0.5 
m Frequency (cycles/sampling interval) 
FIGURE 9.9 


Data segment from an AR(4) process, and the corresponding autocorrelation, partial 
autocorrelation, and periodogram. 


Frequency-domain interpretation. nthe case of full windowing, by using Parseval’s 
theorem, the error energy can be written as 


[ee 


= 2_ 1 [ [x (e/®))? 
E= Yo lem? = — Reo rere (9.2.17) 


n=—OO 


where |X (ef ey (2 is the spectrum of the modeled windowed signal segment and H (e/”) is 
the frequency response of the estimated all-pole model [or estimated spectrum of x(n)]. 
This expression is a good approximation for the other windowing methods if N >> P. Since 
the integrand in (9.2.17) is positive, minimizing the error € is equivalent to minimizing the 
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Periodogram, theoretical AR(4) spectrum, and AR(4) model 
spectra using full windowing, Hamming windowing, and no 
windowing. 
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FIGURE 9.11 
Residual sequence for the AR(4) data, and the corresponding autocorrelation, partial 


autocorrelation, and periodogram. 


integrated ratio of the energy spectrum of the modeled signal segment to its all-pole-based 
spectrum. 

The presence of this ratio in (9.2.17) has three additional consequences. (1) The quality 
of the spectral matching is uniform over the whole frequency range, irrespective of the 
shape of the spectrum. (2) Since regions where |X (e/ ®)| > |H (e/ “| contribute more to 
the total error than regions where |X (e/)| < |H (e/®)| do, the match is better near spectral 
peaks than near spectral valleys. (3) The all-pole model provides a good estimate of the 
envelope of the signal spectrum |X (e/”)|*. These properties are apparent in Figure 9.12, 
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Illustration of the spectral envelope matching property of 
all-pole models. 


which shows a comparison between 20 log |X (e/”)| (obtained using the periodogram) and 
20 log |H (e/”)| [obtained by an AP(28) model fitted using full windowing] for a 20-ms, 
Hamming windowed, speech signal sampled at 20 kHz. Note that the slope of |H (e/®)| is 
always zero at frequencies w = 0 and w = 7, as expected. More details on these issues 
can be found in Makhoul (1975b). 

The error energy (9.2.17) is also related to the Itakura-Saito (IS) distortion measure, 
which is given by 


dig(R1, R2) = - a [exp V(e/®) — V(el”) — 1] dw (9.2.18) 
where Rj (e/) and R2(e/®) are two spectra, and 
V(el%) © log Ry (e/”) — log Ro(e/®) (9.2.19) 
Indeed, we can show that 
RS i Beale log Fj (9.2.20) 
20 J_7 Ro(e/”) o 


2 


where oF and a. are the variances of the innovation sequences corresponding to the factor- 
ization of spectra R (e/®) and Ro(e/®), respectively. More details can be found in Rabiner 
and Juang (1993). 


Order selection criteria. The order of an all-pole signal model plays an important role 
in the modeling problem. It determines the number of parameters to be estimated and hence 
the computational complexity of the algorithm. But more importantly, it affects the quality 
of the spectrum estimates. If a much lower order is selected, then the resulting spectrum 
will be smooth and will display poor resolution. If a much larger order is used, then the 
spectrum may contain spurious peaks at best and a phenomenon called spectrum splitting 
at worst, in which a single peak is split into two separate and distinct peaks (Hayes 1996). 

Several criteria have been proposed over the years for model order selection; however, 
in practice nothing surpasses the graphical approach outlined in Examples 9.2.1 and 9.2.2 
combined with the experience of the user. Therefore, we only provide a brief summary of 
some well-known criteria and refer the interested reader to Kay (1988), Porat (1994), and 
Ljung (1987) for more details. The simplest approach would be to monitor the modeling 
error and then select the order at which this error enters a steady state. However, for all-pole 
models, the modeling error is monotonically decreasing, which makes this approach all but 
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impossible. The general idea behind the suggested criterion is to introduce a penalty function 
in the modeling error that increases with the model order P. We present the following four 
criteria that are based on the above general idea. 


FPE criterion. The final prediction error (FPE) criterion, proposed by Akaike (1970), 
is based on the function 


‘ae ae 
Won’? (9.2.21) 


where Go is the modeling error [or variance of the residual of the estimated AP(P) model]. 
We note that the term 6> decreases or remains the same with increasing P, whereas the 


FPE(P) = 


term (N + P)/(N — P) accounts for the increase in 6% due to inaccuracies in the estimated 
parameters and increases with P. Clearly, FPE(P) is an inflated version of cre The FPE 
order selection criterion is to choose P that will minimize the function in (9.2.21). 


AIC. The Akaike information criterion (AIC), also introduced by Akaike (1974), is 
based on the function 


AIC(P) = N log 6% +2P (9.2.22) 


It is a very general criterion that provides an estimate of the Kullback-Leibler distance 
(Kullback 1959) between an assumed and the true probability density function of the data. 
The performances of the FPE criterion and the AIC are quite similar. 


MDL criterion. The minimum description length (MDL) criterion was proposed by 
Risannen (1978) and uses the function 


MDL(P) = N logé> + PlogN (9.2.23) 


The first term in (9.2.23) decreases with P, but the second penalty term increases. It has 
been shown (Risannen 1978) that this criterion provides a consistent order estimate in that 
as the probability that the estimated order is equal to the true order approaches 1, the data 
length N tends to infinity. 


CAT. This criterion is based on Parzen’s criterion autoregressive transfer (CAT) func- 
tion (Parzen 1977), which is given by 


TeON =k NSP 
ye 


Ne .Woe Nee 


CAT(P) = (9.2.24) 
k=1 
This criterion is asymptotically equivalent to the AIC and the MDL criteria. 

Basically, all order selection criteria add to the variance of the residuals a term that grows 
with the order of the model and estimate the order of the model by minimizing the resulting 
criterion. However, when P < N, which is the case in many practical applications, the 
criterion does not exhibit a clear minimum that makes the order selection process difficult 
(see Problem 9.1). 


9.2.2 Lattice Structures 


We noted in Section 7.5 that a prediction error filter, and hence the AP model, can also be 
implemented by using a lattice structure. The Pth-order forward prediction error e(n) = 
el, (n) and the total squared error 


Ne 
Ep = >> le(n)/? (9.2.25) 
n=N, 


are nonlinear functions of the lattice parameters k,, 0 < m < P — 1. For example, if 
P =2, we have 


eh(n) = x(n) + (ko + kokf) x(n — 1) + kix(n — 2) 


which shows that ef (n) depends on the product kok}. Thus, fitting an all-pole lattice model 
by minimizing Ep with respect to kn,0 < m < P — 1, leads to a difficult nonlinear 
optimization problem. 

We can avoid this problem by replacing the above “global” optimization with P “local” 
optimizations from m = | to P, one for each stage of the lattice. From the lattice equations 


en) =ef_ in) + k*_,e°_ s(n —1) (9.2.26) 
e& (n) =e? _j(n—1) + km—1et,_ 1”) (9.2.27) 


we see that the mth-order prediction errors depend on the coefficient k,,_; only. Furthermore, 
the values of <7 @) and oe (n) have been computed by using kj,—2, which has been 
determined from the optimization step at the previous stage. 

Hence, to minimize the forward prediction error 


Ng 
GS yen? (9.2.28) 
n=N;, 


we substitute (9.2.26) into (9.2.28) and differentiate’ with respect to Ke 4s This leads to 
the following optimum value of k,,—1 


ptb ‘ 
FP = 
kn—1 = — ob (9.2.29) 

m—1 

Ng 
where pie = x [ef (n)}*e® (n — 1) (9.2.30) 
n=N, 
Ne 
and Ep = do leh (n-1)P (9.2.31) 
n=N, 


Similarly, minimization of the backward prediction error (9.2.31) gives 


BP ipa 
ros = (9.2.32) 


Burg (1967) suggested the estimation of k,,_; by minimizing 


Ne 
EP = Vo flei(n)I? + Leb (n)/7} (9.2.33) 
n=N, 


at each stage of the lattice.’ Indeed, substituting (9.2.26) and (9.2.27) in the last equation, 


we obtain the relationship 


EF = (1 + Ukm—17)E4_, +4 Re(Ke_ 6) +t lkm12E2_, (9.2.34) 


m—1 


"See Appendix B for a discussion of how to find an optimum of a real-valued function of a complex variable and 
its conjugate. 

"This approach should not be confused with the maximum entropy method introduced also by Burg and discussed 
later. 
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If we set aet> / Ok* 4 = 0, we obtain the following estimate of kj,—1: 


fb FP 7;BP 
B = cpa 2Kkin—1Km—1 (9.2.35) 


m—1 lref b = 7FP BP 
7, + En) Kn he 


We note that eee is the harmonic mean of ee , and KBE ,- We also stress that the obtained 
model is different from the one resulting from the forward-backward least-squares (FBLS) 
method through global optimization [see (9.2.16)]. 

Itakura and Saito (1971) proposed an estimate of k,,_; based on replacing the theoretical 
ensemble averages in (7.5.24) by time averages. Their estimate is given by 


pte. 
Bs = = = sign(&EP jor KBP 1),/KFP ABP (9.2.36) 
ays 


m—1~m-1 
and is also known as the geometric mean method. Since it can be shown that 
B IS 
IKin—1| s [Km—1| = (9.2.37) 


both estimates result in minimum-phase models (see Problem 9.2). From (9.2.36) and 
(9.2.37) we conclude that if |kFP || < 1, then |kBP || > 1 and vice versa; that is, if the FLP 
is minimum-phase, then the BLP is maximum-phase and vice versa. Several other estimates 
are discussed in Makhoul (1977) and Viswanathan and Makhoul (1975). 

Inall previous methods, we use no windowing; that is, we set Nj = mand Ne = N—1.If 
we use data windowing, all the above estimates are identical to the data windowing estimates 
obtained using the algorithm of Levinson-Durbin (see Problem 9.3). 

The variance of the residuals can be estimated by 


1 efb 
n2 
Ce ae racer (9.2.38) 
which for large values of N (see Problem 9.12) can be approximated by 
G7, = 6711 = lkm—117) (9.2.39) 
jx 
where 6o= 5 Xu |x (n)|7 (9.2.40) 
n= 


The computations for the lattice estimation methods are summarized in Table 9.1, and the 
algorithms are implemented by the function [k, var] = aplatest(x,P). 


9.2.3 Maximum Entropy Method 
We next show how LS all-pole modeling is related to Burg’s method of maximum entropy. 


To this end, suppose that x(n) is a normal, stationary process with zero mean. The M- 
dimensional complex-valued vector x = x(n) obeys a normal distribution 


! Hp-l 
p(x) = = era exp(—x" Rx) (9.2.41) 
where R is a Toeplitz correlation matrix. By definition, its entropy is given by 
H(x) = —E flog p(x)} = Mlog a +log(detR) + M (9.2.42) 


because E{x4R7-!x} = M. If the process x(n) is regular, that is, |k,,| < 1 for all m, we 
have 
M-1 m 
detR=[] Pn and = Pn =r) | JU = 1k; 1”) (9.2.43) 
m=0 j=l 
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Algorithm for estimation of AP lattice parameters. Sree al 
SECTION 9.2 


7 Estimation of All-Pole 
1. Input: x(n) for Nj <n < Nr Models 


2. Initialization 
a. eb(n) = eP(n) = x(n). 
b. Compute By, Gi. and oF from x(n). 
c. Compute KP and ae 


d. Compute either ie or ke from ae and ae 


e. Apply the first stage of the lattice to x() using either ke 


3. For m = 2,3,...,P 

a. Compute o ees i 
FP and ;BP 
m—1 m—1" 


or ke to obtain ef (n) and eb (n). 


f 


‘m—| 7) and a 4G). 


b e 
and En , from e 


b. Compute k 
ao ae 
f 


m—1 


c. Compute either RS or k8_, from 
m—1 m—1 


Apply the mth stage of the lattice to e (n) and ey (n) using either he ,or pA , to obtain 
e, (n) and eb (n). 


4. Output: Either AIS or kB form =1,2,..., P and ef, (n) and e® (n). 


m 


where P,, = Pp a Ve (see Section 7.4). If we substitute (9.2.43) into (9.2.42), we obtain 


M-1 
H(x) = Mloga + M+ Mlogr(0) + )~ (M —m)log(1 = |km|?) (9.2.44) 


m=1 


which expresses the entropy in terms of r(0) and the PACS k,,, 1 <m < M < ow [recall that 
any parametric model can be specified by r (0) and the PACS]. Suppose now that we are given 
the first P + 1 values r(0),r(1), ...,7(P) of the autocorrelation sequence and we wish to 
find a model, by choosing the remaining values r(/), / > P, so that the entropy is maximized. 
From (9.2.44), we see that the entropy is maximized if we choose kj, = 0 form > P, that 
is, by modeling the process x(n) by an AR(P) model. In conclusion, among all regular 
Gaussian processes with the same first P + | autocorrelation values, the AR(P) process has 
the maximum entropy. Any other choices for k,,,m > P, that satisfy the condition |ky»| < 1 
lead to a valid extension of the autocorrelation sequence. The “extended” values r(/),/ > P, 
can be obtained by using the inverse Levinson-Durbin or the inverse Schitir algorithm (see 
Chapter 7). The relation between autoregressive modeling and the principle of maximum 
entropy, known as the maximum entropy method, was introduced by Burg (1967, 1975). We 
note that the above proof, given in Porat (1994), is different from the original proof provided 
by Burg (Burg 1975; Therrien 1992). An interesting discussion of various arguments in favor 
of and against the maximum entropy method can be found in Makhoul (1986). 


9.2.4 Excitations with Line Spectra 


When the excitation of a parametric model has a spectrum with lines at L frequencies wn, 
the spectrum of the output signal provides information about the frequency response of the 
model at these frequencies only. For simplicity, assume equidistant samples at frequencies 
Om = 2nm/L,0 <m < L—1.Givenaset of values R,(e/°") = |X (e/")|?, we wish to 
find an AP(P) model whose spectrum Ry (e/”) matches Ry(@m) at the given frequencies, 
by minimizing the criterion 


(9.2.45) 
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which is the discrete version of (9.2.17) and dp is the gain of the model (see Section 4.2). 
The minimization of (9.2.45) with respect to the model parameters {a;} results in the Yule- 
Walker equations 


P a ‘ 
E =0 
> afr G — k= a (9.2.46) 
a, 0) 1<i<P 
Pesee . 
where rl) = 7 > Ry (elm) efom (9.2.47) 


m=1 


For continuous spectra, linear prediction uses the autocorrelation 


1 2 Ban. 
Oa / Ry(e!”) edo (0.2.48) 
2 jee 
which is related to r(/) by 
lee) 
FD) = > r(l — Lm) (9.2.49) 
m=—-OO 


that is, 7(/) is an aliased version of r(J). We have seen that linear prediction equates the 
autocorrelation of the AP(P) model to the autocorrelation of the modeled signal for the 
first P + 1 lags. Hence, when we use linear prediction for a signal with line spectra, 
the autocorrelation of the all-pole model will be matched to r(J) 4 r(/) and will always 
result in a model different from the original. Clearly, the correlation matching condition 
cannot compensate for the autocorrelation aliasing, which becomes more pronounced as L 
decreases. This phenomenon, which is severe for voiced sounds with high pitch, is illustrated 
in Problem 9.13. A method that provides better estimates, by minimizing a discrete version 
of the Itakura-Saito error measure, has been developed for both AP and PZ models by 
El-Jaroudi and Makhoul (1991, 1989). 


9.3 ESTIMATION OF POLE-ZERO MODELS 


The estimation of PZ(P, Q) model parameters for Q 4 0 leads to a nonlinear LS op- 
timization problem. As a result, a vast number of suboptimum methods, with reduced 
computational complexity, have been developed to avoid this problem. For example, some 
techniques estimate the AP(P) and AZ(Q) parameters separately. However, today the avail- 
ability of high-speed computers has made exact least-squares the method of choice. Since 
the nonlinear LS optimization with respect to complex vectors and its conjugate is inherently 
difficult, and since this optimization does not provide any additional insight into the solu- 
tion technique, we assume, in this section, that the quantities are real-valued. Furthermore, 
most of the real-world applications of pole-zero models almost always involve real-valued 
signals and systems. The extension to the complex-valued case is straightforward. 
Consider the PZ(P, Q) model 


P Q 
x(n) = — > agx(n — k) + w(n) + D> dkw(n — k) (9.3.1) 


k=1 k=1 
where w(n) ~ WN(0, o7,). Using vector notation, we can express (9.3.1) as 
x(n) = 2" (n — ep, + w(n) (9.3.2) 
where z(n) & [—x(n) --- —x(n—P+1)w(n)--- w(n-O+ pI (9.3.3) 
and Cp= [a’ d?] = [ay --» ap dy --- dg)’ (9.3.4) 


9.3.1 Known Excitation 


Assume for a moment that the excitation w(n) is known. Then we can predict x(n) from 
past values, using the following linear predictor 


&(n) =z" (n—l)e (9.3.5) 
where c=[4--- Gp d,--- dol’ (9.3.6) 
are the predictor parameters. The prediction error 
e(n) = x(n) — X(n) = x(n) — z(n —lLe (9.3.7) 
equals w(n) if ¢ = ¢p,. Minimization of the total squared error 
Ne 
EC) 4 Yo e(n) (9.3.8) 
n=N, 


leads to the following linear system of equations 


A 


R.c =f, (9.3.9) 
Ng 
where R. = ~ z(n — 1)z! (n— 1) (9.3.10) 
n=N, 
Ne 
and f= D> a(n — 1)x(n) (9.3.11) 
n=N, 


Usually, we use residual windowing, which implies that N; = max(P, Q) and Ns = N—1. 
Since the matrix R, is symmetric and positive semidefinite, we can solve (9.3.9) using LDL” 
decomposition. Thus, if we know the excitation w(n), the least-squares estimation of the 
PZ(P, Q) model parameters reduces to the solution of a linear system of equations. An 
estimate of the input variance is given by 


N-1 


1 
2 
ao Des (9.3.12) 
w N-— max(P, Q) oes 


a 


This method, which is implemented by the function pz1ls.m, is known as the equation-error 
method and can be used to identify a system from input-output data (Ljung 1987) (see 
Problem 9.14). 


9.3.2 Unknown Excitation 
In most applications, the excitation w(n) is never known. However, we can obtain a good 


estimate of x(n) by replacing w(n) by e(n) in (9.3.3). This makes a natural choice if the 
model used to obtain e() is reasonably accurate. The prediction error is then given by 


e(n) = x(n) —&(n) = x(n) —27 (n— De (9.3.13) 
where 2(n) = [—x(n) --- —x(n— P+ 1) e(n) --- e(n— Q+ pI (9.3.14) 
If we write (9.3.13) explicitly 
Q . P 
e(n) =—) > dke(n —k) +x(n) + D> x(n —k) (9.3.15) 
k=1 k=1 


we see that the prediction error is obtained by exciting the inverse model with the signal 
x(n). Hence, the inverse model has to be stable. To satisfy this condition, we require the 
estimated model to be minimum-phase. 
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The recursive computation of e(n) by (9.3.15) makes the prediction error a nonlinear 
function of the model parameters. To illustrate this, consider the prediction error for a 
first-order model, that is, for P = Q = 1 


e(n) = x(n) + Gyx(n — 1) —dje(n — 1) 
Assuming e(0) = 0, we have forn = 1, 2,3 
e(1) = x(1) + 41x (0) 
e(2) = x(2) + ay x(1) — dye(1) 
= x(2) + (@) — di)x(1) — did, x(0) 
e(3) = x(3) + ax(2) — dye(2) 
= x(3) + (@) — di)x(2) — (G1 — dy)dix(1) + G1? x(0) 


which shows that e(7) is a nonlinear function of the model parameters if Q 4 0. Thus, the 
total squared error 


Ne 
EC) = D> e(n) (9.3.16) 
n=N, 
expressed in terms of the signal values x(0), x(1), ..., x(NV — 1), is anonquadratic function 


of the model parameters. Sometimes, €(c) has several local minima. The model parame- 
ters can be obtained by minimizing the total square error using nonlinear optimization 
techniques. 


9.3.3 Nonlinear Least-Squares Optimization 
We next outline such a technique that is based on the method of Gauss-Newton. More details 


can be found in Scales (1985); Luenberger (1984); and Gill, Murray, and Wright (1981). 
To this end, we expand the function €(c) as a Taylor series 


E(eg + Ac) = E(eo) + (Ac)! VE(co) + 5 (Ac)"[V7E (eo) (Ac) +--+: (9.3.17) 
T 
where VE(e) = og oe tee ue (9.3.18) 
dc) 0c2 ICp+q 


is the vector of the first partial derivatives or gradient vector and V7E(c), whose (i, J)th 
element is 07€ /(dc;0c;), is the (symmetric) matrix of second partial derivatives (Hessian 
matrix). 

The Taylor expansion of a quadratic function has only the first three terms. Indeed, for 
the known excitation case we have 


Ne 
VE(c) =2 ye z(n — l)e(n) = 2(r; — R,c) (9.3.19) 
n=N, 
Ne 
and V*E(c) =2 s z(n — 1)z' (n— 1) = 2R, (9.3.20) 
n=N;, 


Higher-order terms are zero, and if cg is the minimum, then VE(co) = 0. In this case, 
(9.3.17) becomes 


E(eo + Ac) = Elen) + (Ac)? R, (Ac) 


which shows that if R, is positive definite, that is, (Ac) R,(Ac) > 0, then any deviation 
from the minimum results in an increase in the total squared error. 

This relationship holds approximately for nonquadratic functions, as long as Cg is close 
to aminimum. Thus, if we are at a point ¢; with total squared error €(¢;), we can move toa 
point ¢;+; with total squared error €(¢;+1) < E(¢;) by moving in the direction of —VE(c;). 
This suggests the following iterative procedure 


C41 = CG; — UW, G; VE(C;) (9.3.21) 


where the positive scalar jz; controls the length of the descent and matrix G; modifies 
the direction of the descent, as is specified by the gradient vector. Various choices for 
these quantities lead to various optimization algorithms. For quadratic functions, choosing 
co = 0, Uo = 1, and Go = (2R.)7! (inverse of the Hessian matrix) gives c; = R>'f,; that 
is, we find the unique minimum in one step. This provides the motivation for modifying 
the direction of the gradient using the inverse of the Hessian matrix, even for nonquadratic 
functions. This choice is justified as long as we are close to a minimum. 
Using (9.3.13), we compute the Hessian as follows 


Ne Ne 
V*E(c) = VIVE(c)]’ =2 oy wn)! (n) +2 os [Vw! (n)]e(n) (9.3.22) 


n=N;, n=N, 
a de(n) 9 de(n) |" 

ee $(n) £ Ve(n) = eo) os e(n) en) i a (9.3.23) 

day dap 0d, ddg 
We usually approximate the Hessian with the first summation in (9.3.22), that is, 

Ne 
H=2 )° ¥(nyy"(n) (9.3.24) 
n=N, 
Similarly, the gradient is given by 
Ne 
VE(ce) f£v=2 3 win) e(n) (9.3.25) 
n=N;, 


If we set G = H™|, the direction vector g = Gy = H~'v can be obtained by solving the 
following linear system of equations: 


Hg =v (9.3.26) 


Clearly, the factor 2 in the definitions of H and v does not affect the solution g, and can 
be dropped. Although the matrix H is guaranteed by (9.3.24) to be positive semidefinite, in 
practice it may be singular or close to singular. To avoid such problems in solving (9.3.26), 
we regularize the matrix by adding a small positive constant 6 to its diagonal; that is, we 
approximate the Hessian by H + 6I, where I is the identity matrix. This approach is known 
as the Levenberg-Marquard regularization (Dennis and Schnabel 1983; Ljung 1987). 

We next compute the gradient #(n) = Ve(n), using (9.3.23) and (9.3.15). Indeed, we 
have 


Q 
P » de(n—k 
eM) saa) yo ad SO REI SP (9.3.27) 
0a; a 0a; 
de(n) oe bei =k) 
and — = —e(n — j)— dk — j=1,2,...,0 (9.3.28) 
dd; tel dd; 
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FIGURE 9.13 

Illustration of the capability of a PZ(4, 2) and AP(10) model to 
estimate the PSD of an ARMA(4, 2) process from a 300-sample 
segment. 


Thus, the components of the gradient vector are obtained by driving the all-pole filter 


1 1 


= 9.3.29 
Di) Q- one 
1+) ode* 
k=1 


Ay(z) = 


with the signals x(n) and —e(n), respectively. This filter is stable if the estimated model is 
minimum-phase. 

The above development leads to the following iterative algorithm, implemented in the 
MATLAB function armals.m, which computes the parameters of a PZ(P, Q) model from 
the data x(0), x(1), ..., x(N — 1) by minimizing the LS error. The LS pole-zero modeling 
algorithm consists of the following steps: 


1. 


2. 


Fit an AP(P + Q) model to the data, using the no-windowing LS method, and compute 
the prediction error e(”) (see Section 9.2). 

Fit a PZ(P, Q) model to the data {x(v), e(n), O < n < N — 1}, using the known 
excitation method. Convert the model to minimum-phase, if necessary. Use Equations 
(9.3.9) to (9.3.11). 


a. 
b. 
Cc. 


. For = 1 


. Start the iterative minimization procedure, which involves the following steps: 


Compute the gradient y(n), using (9.3.27) and (9.3.28). 

Compute the Hessian H and the gradient v, using (9.3.24) and (9.3.25). 

Solve (9.3.26) to compute the search vector g. If necessary, use the Levenberg- 
Marquard regularization technique. 

, +, ee > compute ¢ < c+ wg, convert the model to minimum-phase, 
if necessary, and compute the corresponding value of €(c). Choose the value of ¢ 
that gives the smaller total squared error. 


. Stop if E(c) does not change significantly or if a certain number of iterations have 


been exceeded. 


4. Compute the estimate of the input variance, using (9.3.15) and (9.3.12). 


The application of the LS PZ(P, Q) model estimation algorithm is illustrated in Figure 9.13, 
which shows the actual PSD of a PZ(4, 2) model and the estimated PSDs, using an LS PZ(4, 


"This approach was suggested in Ljung (1987), problem 10S.1. 


2) and an AP(10) model fitted to a 300-sample segment of the output process. We notice 
that, in contrast to the PZ model, the AP model does not provide a good match at the spectral 
zero. More details are provided in Problem 9.15. 


9.4 APPLICATIONS 


Pole-zero modeling has many applications in such fields as spectral estimation, speech 
processing, geophysics, biomedical signal processing, and general time series analysis and 
forecasting (Marple 1987; Kay 1988; Robinson and Treitel 1980; Box, Jenkins, and Reinsel 
1994). In this section, we discuss the application of pole-zero models to spectral estimation 
and speech processing. 


9.4.1 Spectral Estimation 


After we have estimated the parameters of a PZ model, we can compute the PSD of the 
analyzed process by 


Q 
1+ 2D dpelok 
2 k=1 
w 


P 
1+ So aesok 
k=1 


In practice, we mainly use AP models because (1) the all-zero PSD estimator is essentially 
identical to the Blackman-Tukey one (see Problem 9.16) and (2) the application of pole- 
zero PSD estimators is limited by computational and other practical difficulties. Also, any 
continuous PSD can be approximated arbitrarily well by the PSD of an AP(P) model if P 
is chosen large enough (Anderson 1971). However, in practice, the value of P is limited 
by the amount of available data (usually P < N/3). The statistical properties of all-pole 
PSD estimators are difficult to obtain; however, it has been shown that the estimator is 
consistent only if the analyzed process is AR(Po) with Py < P. Furthermore, the quality of 
the estimator degrades if the process is contaminated by noise. More details about pole-zero 
PSD estimation can be found in Kay (1988), Porat (1994), and Percival and Walden (1993). 

The performance of all-pole PSD estimators depends on the method used to estimate 
the model parameters, the order of the model, and the presence of noise. The effect of 
model mismatch is shown in Figure 9.13 and is further investigated in Problem 9.17. Order 
selection in all-pole PSD estimation is absolutely critical: If P is too large, the obtained 
PSD exhibits spurious peaks; if P is too small, the structure of the PSD is smoothed over. 
The increased resolution of the parametric techniques, compared to the nonparametric PSD 
estimation methods, is basically the result of imposing structure on the data (i.e., a model). 
The model makes possible the extrapolation of the ACS, which in turns leads to better 
resolution. However, if the adopted model is inaccurate, that is, if it does not match the 
data, then the “gained” resolution reflects the model and not the data! As a result, despite 
their popularity and their “success” with simulated signals, the application of parametric 
PSD estimation techniques to actual experimental data is rather limited. 

Figure 9.14 shows the results of a Monte Carlo simulation of various all-pole PSD 
estimation techniques. We see that, except for the windowing approach that results in a 
significant loss of resolution, all other techniques have similar performance. However, we 
should mention that the forward/backward LS all-pole modeling method is considered to 
provide the best results (Marple 1987). 


R(ei®) =6 (9.4.1) 
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FIGURE 9.14 


Monte Carlo simulation for the comparison of all-pole PSD estimation techniques, using 50 
realizations of a 50-sample segment from an AR(4) process using fourth-order AP models. 


In practice, it is our experience that the best way to estimate the PSD of an actual signal 


is to combine parametric prewhitening with nonparametric PSD estimation methods. The 
process is illustrated in Figure 9.15 and involves the following steps: 


1. 


2: 


Fit an AP(P) model to the data using the forward LS, forward/backward LS, or Burg’s 
method with no windowing. 
Compute the residual (prediction error) 


P 
e(n) =x(n) + Yoafx(n—k) = P<n<N-1 (9.4.2) 
k=1 


and then compute and plot its ACS, PACS, and cumulative periodogram (see Figure 9.2) 
to see if it is reasonably white. The goal is not to completely whiten the residual but 
to reduce its spectral dynamic range, that is, to increase its spectral flatness to avoid 
spectral leakage. 

Compute the PSD R.(e/ “k), using one of the nonparametric techniques discussed in 
Chapter 5. 

Compute the PSD of x(n) by 


Reel) 


WGLsIG (9.4.3) 


Ry(el*) = 


that is, by applying postcoloring to “undo” the prewhitening. 


Frame 
blocking 


Compute AP JA (pi@ky|2 
P 
FIGURE 9.15 
Block diagram of nonparametric PSD estimation using linear prediction 
prewhitening. 


The main goal of AP modeling here is to reduce the spectral dynamic range to avoid 


leakage. In other words, we need a good linear predictor regardless of whether the process is 
true AR(P). Therefore, very accurate order selection and model fit are not critical, because 
all spectral structure not captured by the model is still in the residuals. Needless to say, if 
the periodogram of x(7) has a small dynamic range, we do not need prewhitening. Another 
interesting application of prewhitening is for the detection of outliers in practical data 
(Martin and Thomson 1982). 


Power (dB) 


EXAMPLE 9.4.1. To illustrate the effectiveness of the above prewhitening and postcoloring 
method, consider the AR(4) process x (1) used in Example 9.2.3. This process has a large dynamic 
range, and hence the nonparametric methods such as Welch’s periodogram averaging method 
will suffer from leakage problems. Using the system function of the model 


1 1 
A(z) 1 —2.7607z—! + 3.8106z~2 — 2.6535z—3 + 0.9238z—4 


and WGN (0, 1) input sequence, we generated 256 samples of x(n). These samples were then 
used to obtain the all-pole LS predictor coefficients using the arwin function. The spectrum 
|A(e/®)|-2 corresponding to this estimated model is shown in Figure 9.16 as a dashed curve. The 
signal samples were prewhitened using the model to obtain the residuals e(7). The nonparametric 
PSD estimate Ro (eJ ®) of e(n) was computed by using Welch’s method with L = 64 and 50 
percent overlap. Finally, Re (e/®) was postcolored using the spectrum |A(e/”)|—2 to obtain 
R,(e/®), which is shown in Figure 9.16 as a solid line. For comparison purposes, the Welch 


A(z) = 


PSD estimation of AR(4) signal FIGURE 9.16 
50 ; : P hit je al Spectral estimation of AR(4) 
— Prewhiten/postcolor . as 
40 L ---- AR(4) PSD | Process using prewhitening and 
FAS 2 ee ee Welch PSD postcoloring method in 
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PSD estimate of x(n) is also shown as a dotted line. As expected, the nonparametric estimate 
does not resolve the two peaks in the true spectrum and suffers from leakage at high frequencies. 
However, the combined nonparametric and parametric estimate resolves two peaks with ease 
and also follows the true spectrum quite well. Therefore, the use of the parametric method as a 
preprocessor is highly recommended especially in large-dynamic-range situations. 


9.4.2 Speech Modeling 


All-pole modeling using LS linear prediction is widely employed in speech processing 
applications because (1) it provides a good approximation to the vocal tract for voiced 
sounds and adequate approximation for unvoiced and transient sounds, (2) it results ina good 
separation between source (fine spectral structure) and vocal tract (spectral envelop), and 
(3) it is analytically tractable and leads to efficient software and hardware implementations. 
Figure 9.17 shows a typical AP modeling system, also known as the linear predictive 
coding (LPC) processor, that is used in speech synthesis, coding, and recognition applica- 
tions. The processor operates in a block processing mode; that is, it processes a frame of N 
samples and computes a vector of model parameters using the following basic steps: 


1. Preemphasis. The digitized speech signal is filtered by the high-pass filter 
Ay(z)=1—-az7! 09<a<1 (9.4.4) 


to reduce the dynamic range of the spectrum, that is, to flatten the spectral envelope, 
and make subsequent processing less sensitive to numerical problems (Makhoul 1975a). 
Usually a = 0.95, which results in about a 32 dB boost in the spectrum at w = 7 over 
that at @ = 0. The preemphasizer can be made adaptive by setting a = p(1), where 
p(J) is the normalized autocorrelation of the frame, which corresponds to a first-order 
optimum prediction error filter. 

2. Frame blocking. Here the preemphasized signal is blocked into frames of N samples with 
successive frames overlapping by No ~ N/3 samples. In speech recognition N = 300 
with a sampling rate F; = 6.67 Hz, which corresponds to 45-ms frames overlapping by 
15 ms. 

3. Windowing. Each frame is multiplied by an N-sample window (usually Hamming) to 
smooth the discontinuities at the beginning and the end of the frame. 

4. Autocorrelation computation. Here the LPC processor computes the first P + 1 values 
of the autocorrelation sequence. Usually, P = 8 in speech recognition and P = 12 in 
speech coding applications. The value of r(0) provides the energy of the frame, which 
is useful for speech detection. 

5. LPC analysis. In this step the processor uses the P + | autocorrelations to compute an 
LPC parameter set for each speech frame. Depending on the required parameters, we 


a N No w(n) 


P 
LPC parameter Levinson-Durbin 


conversion Of Schur 
algorithm 


Autocorrelation 
computation 


FIGURE 9.17 
Block diagram of an AP modeling processor for speech coding and recognition. 


can use the algorithm of Levinson-Durbin or the algorithm of Schiir. The most widely 


used parameters are 
(P) 


dn = am LPC coefficients 
km PACS 
1 1—kn -] : . 
&n = 5 log 1+ kn = tanh ky, log area ratio coefficients 
c(m) cepstral coefficients 
Om line spectrum pairs 


where | < m < P, except for the cepstrum, which is computed up to about 3 P/2. 
The line spectrum pair parameters, which are pole angles of the singular filters, were 
discussed in Section 2.5.8, and their application to speech processing is considered in 
Furui (1989). 


The log area ratio and the line spectrum pair coefficients have good quantization prop- 
erties and are used for speech coding (Rabiner and Schafer 1978; Furui 1989); the cepstral 
coefficients provide an excellent discriminant for speech and speaker recognition applica- 
tions (Rabiner and Juang 1993; Mammone et al. 1996). AP models are extensively used 
for the modeling of speech sounds. However, the AP model does not provide an accurate 
description of the speech spectral envelope when the speech production process resembles a 
PZ system (Atal and Schroeder 1978). This can happen when (1) the nasal tract is coupled to 
the main vocal tract through the velar opening, for example, during the generation of nasals 
and nasalized sounds, (2) the source of excitation is not at the glottis but is in the interior 
of the vocal tract (Flanagan 1972), and (3) the transmission or recording channel has zeros 
in its response. Although a zero can be approximated with arbitrary precision by a number 
of poles, this approximation is usually inefficient and leads to spectral distortion and other 
problems. These problems can be avoided by using pole-zero modeling, as illustrated in 
the following example. More details about pole-zero speech modeling can be found in Atal 
and Schroeder (1978). 

Figure 9.18(a) shows a Hamming window segment from an artificial nasal speech 
signal sampled at F; = 10 kHz. According to acoustic theory, such sounds require both 
poles and zeros in the vocal tract system function. Before the fitting of the model, the data are 
passed though a preemphasis filter with a = 0.95. Figure 9.18(b) shows the periodogram 
of the speech segment, the spectrum of an AP(16) model using data windowing, and the 
spectrum of a PZ(12, 6) model using the least-squares algorithm described in Section 9.3.3 
(see Problem 9.18 for details). We see that the pole-zero model matches zeros (“valleys”) 
in the periodogram of the data better than other models do. 


9.5 MINIMUM-VARIANCE SPECTRUM ESTIMATION 


Spectral estimation methods were discussed in Chapter 5 that are based on the discrete 
Fourier transform (DFT) and are data-independent; that is, the processing does not depend 
on the actual values of the samples to be analyzed. Window functions can be employed to 
cut down on sidelobe leakage, at the expense of resolution. These methods have, as a rule of 
thumb, an approximate resolution of Af ~ 1/N cycles per sampling interval. Thus, for all 
these methods, resolution performance is limited by the number of available data samples 
N. This problem is only accentuated when the data must be subdivided into segments to 
reduce the variance of the spectrum estimate by averaging periodograms. The effective 
resolution is then on the order of 1/M, where M is the window length of the segments. 
For many applications the amount of data available for spectrum estimation may be limited 
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FIGURE 9.18 

(a) Speech segment and 

(b) periodogram, spectrum of a 
data windowing-based AP(16) 
model, and spectrum of a 
residual windowing-based 
PZ(12, 6) model. 
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(b) 


either because the signal may only be considered stationary over limited intervals of time 
or may only be collected over a short finite interval. 

Many times, it may be necessary to resolve spectral peaks that are spaced closer than 
the 1/M limit imposed by the amount of data available. All the DFT-based methods use a 
predetermined, fixed processing that is independent of the values of the data. However, there 
are methods, termed data-adaptive spectrum estimation (Lacoss 1971), that can exploit ac- 
tual characteristics of the data to offer significant improvements over the data-independent, 
DFT-based methods, particularly in the case of limited data samples. Minimum-variance 
spectral estimation is one such technique (Capon 1969). Like the methods from Chapter 5, 
the minimum-variance spectral estimator is nonparametric; that is, it does not assume an 
underlying model for the data. However, the spectral estimator adapts itself to the character- 
istics of the data in order to reject as much out-of-band energy, that is, leakage, as possible. 
In addition, minimum-variance spectral estimation provides improved resolution—better 
than the Af ~ 1/N associated with the DFT-based methods. As a result, the minimum- 
variance method is commonly referred to as a high-resolution spectral estimator. Note 
that model-based data-adaptive methods, such as the LS all-pole method, also have high 
resolving capabilities when the model adequately represents the data. 


Theory 


We derive the minimum-variance spectral estimator by using a filter bank structure in 
which each of the filters adapts its response to the data. Recall that the goal of a power 
spectrum estimator is to determine the power content of a signal at a certain frequency. To 
this end, we would like to measure R(e/?"/) at the frequency of interest only and not have 
our estimate influenced by energy present at other frequencies. Thus, we might interpret 
spectral estimation as a methodology in determining the ideal, frequency-selective filter 
for each frequency. Recall the filter bank interpretation of a power spectral estimator from 
Chapter 5. This ideal filter for f, should pass energy within its bandwidth Af but reject all 


other energy, that is, 
— |f- fil 
|He(e?™)? =} Af eae 


0) otherwise 


(9.5.1) 


where the factor Af ~ 1/M accounts for the filter bandwidth.’ Therefore, the filter does 
not impart a gain across the bandwidth of the filter, and the output of the filter is a measure 
of power in the frequency band around f;. However, since such an ideal filter does not exist 
in practice, we need to design one that passes energy at the center frequency while rejecting 
as much out-of-band energy as possible. 

A filter bank—based spectral estimator should have filters at all frequencies of interest. 
The filters have equal spacing in frequency, spanning the fundamental frequency range 


eee f< i. Let us denote the total number of frequencies by K and the center frequency 
of the kth filter as 
rae k-1 1 (9.5.2) 
k= ee 5 oD: 
fork = 1,2,..., K. The output of the kth filter is the convolution of the signal x(7) with 
the impulse response of the filter h;(), which can also be expressed in vector form as 
M-1 
ye(n) = hy(n) * x(n) = ys hy(m)x(n — m) = cf! x(n) (9.5.3) 
m=0 
where cx = [hi (0) hE) «++ hg(M — 1] (9.5.4) 
is the impulse response of the kth filter, and 
x(n) = [x(n) x(n—1) --- xn -M+4+1)]" (9.5.5) 


is the input data vector. In addition, we define the frequency vector v(f) as a vector of 
complex exponentials at frequency f within the time-window vector from (9.5.5) 


vif) =[1 e I2Af ... gE MAE (9.5.6) 


When the frequency vector v(f) is chosen as the filter weight vector in (9.5.4), then the 
filter will pass signals at frequency f. Note that if we have c, = v(f;), then the resulting 
filter bank performs a DFT since v(f) is a column vector in the DFT matrix. Thus, all the 
DFT-based methods, when interpreted using a filter bank structure, use a form of v(f), 
possibly with a window, as filter weights. See Chapter 5 for the filter bank interpretation of 
the DFT. 

The output yx (1) of the kth filter should ideally give an estimate of the power spectrum 
at f;. The output power of the kth filter is 


Eflye(n)|?} = cf Ryex (9.5.7) 


where R, = E{x(n)x# (n)} is the correlation matrix of the input data vector from (9.5.5). 
Since the ideal filter response from (9.5.1) cannot be realized, we instead constrain our filter 
cx to have a response at the center frequency f;, of 


Ak (fe) = lel vf? = M (9.5.8) 


This constraint ensures that the center frequency of our bandpass filter is at the frequency 
J. To eliminate as much out-of-band energy as possible, the filter is formulated as the filter 
that minimizes its output power subject to the center frequency constraint in (9.5.8), that is, 


min cf Rex subject to cH V( fx) =J/M (9.5.9) 


‘a similar normalization was performed for all the DFT-based methods. Note that the same is not true of a sinusoidal 
signal that has zero bandwidth. See Example 9.5.2. 
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This constraint requires the filter to have a response of VM toa frequency vector from (9.5.6) 
at the frequency of interest while rejecting (minimizing) energy from all other frequencies. 
The solution to this constrained optimization problem can be found via Lagrange multipliers 
(see Appendix B and Problem 9.22) to be 


VMR_'v( fx) 


oe Sane len 

vi" (fe)Rx VC fe) 

By substituting (9.5.10) into (9.5.3), we obtain the output of the kth filter. The power of this 
signal, from (9.5.7), is the minimum-variance spectral estimate 


(9.5.10) 


Rye?) = Eflye(n)/?} = =—— > —— (9.5.11) 
v" (fe)Rx VC fe) 

where the subscript M denotes the length of the data vector used to compute the spectral 
estimate. Note that in order to compute the minimum-variance spectral estimate, we need 
to find the inverse of the correlation matrix, which is a Toeplitz matrix since x(7) is sta- 
tionary. Efficient techniques for computing the inverse of a Toeplitz matrix were discussed 


in Chapter 7. 


Implementation 

A spectral estimator attempts to determine the power of a random process as a function 

of frequency based on a finite set of observations. Since the minimum-variance estimate of 

the spectrum involves the correlation matrix of the input data vector, which is unknown in 

practice, the correlation matrix must be estimated from the data. An estimate of the M x M 
correlation matrix, known as the sample correlation matrix, is given by’ 

A 1 


R, = ———__x"’x (9.5.12) 
N-M+1 
where X4# =[x(M) x(M+1)--- x(N)] 
x(M) x(M+1) --- x(N) 
7 x(M—1) x(M) --» x(N—1) (9.5.13) 
+) x(2) ae x(N—M +41) 


is the data matrix formed from x(n) for 0 < n < N — 1. Any of the other methods of 
forming a data matrix discussed in Chapter 8 can also be employed. Note that the data 
matrix in (9.5.13) does not produce a Toeplitz matrix R, in (9.5.12), though other methods 
from Chapter 8 will produce a Toeplitz sample correlation matrix. 

An estimate of the spectrum based on the sample correlation matrix is found by sub- 
stituting R, for the true correlation matrix R,, in (9.5.11). Note that, in practice, the sample 
correlation matrix is not actually computed. The form of the sample correlation matrix re- 
sembles the product of the data matrices in the least-squares (LS) problem that is addressed 
in Chapter 8. Therefore, we might compute the upper triangular factor of the data matrix X 
by using one of the techniques discussed in Chapter 8, such as a QR factorization. Indeed, 
if we compute the QR factorization made up of the orthonormal matrix Q, and the upper 
triangular factor Rx 


X=Q,R,x (9.5.14) 
then the minimum-variance spectrum estimator based on the sample correlation matrix is 
1 M 


po (e)2 fk) = (9.5.15) 


N-M+1 w4(fyRe* 2 


‘We have normalized by N — M +1, the number of realizations of the time-window vector x() in the data matrix 
X. This normalization is necessary so that the output of the filter bank corresponds to an estimate of power. 


Note that the conjugation of the upper triangular matrix comes about through the formulation 
of the data matrix in (9.5.13). 

We have not addressed the issue of choosing the filter length M. Ideally, M is chosen to 
be as large as possible in order to maximize the rejection of out-of-band energy. However, 
from a practical point of view, we must place a limit on the filter length. As the filter length 
increases, the size of the data matrix grows, which increases the amount of computation 
necessary. In addition, since we are inherently estimating the correlation matrix, reducing 
the variance of this estimator requires averaging over a set of realizations of the input data 
vector x(7). Thus, for a fixed data record size of N, we must balance the length of the time 
window M against the number of realizations of the input data vector N — M+ 1. 

As we will demonstrate in the following example, the minimum-variance spectrum es- 
timator provides a means of achieving high resolution, certainly better than the Af ~ 1/M 
limit of the DFT-based methods. High resolving capability essentially means that the 
minimum-variance spectrum estimator can better distinguish complex exponential signals 
closely spaced in frequency. This topic is explored further in Section 9.6. However, high 
resolution does not come without a cost. In practice, the spectrum cannot be estimated over 
a continuous frequency interval and must be computed at a finite set of discrete frequency 
points. Since the minimum-variance estimator is based on R,! , itis very sensitive to the ex- 
act frequency points at which the spectrum is estimated. Therefore, the minimum-variance 
spectrum needs to be computed at a very fine frequency spacing in order to accurately mea- 
sure the power of such a complex exponential. In some applications where computational 
cost is a concern, the DFT-based methods are probably preferred, as long as they provide 
the necessary resolution and sidelobe leakage is properly controlled. 


EXAMPLE 9.5.1. In this example, we explore the resolving capability of the minimum-variance 
spectrum estimator and compare its performance to that of a DFT-based method (Bartlett) and 
the all-pole method. Two closely spaced complex exponentials, both with an amplitude of /10, 
at discrete-time frequencies of f = 0.1 and f = 0.12 are contained in noise with unit power 
o%, = 1. We apply the spectrum estimators with time-window lengths (or order) M = 16, 32, 64, 
and 128 to signals consisting of 500 time samples. The estimated spectra were then averaged over 
100 realizations. The resulting average spectrum estimates are shown in Figure 9.19. Note that 
the frequency spacing of the two complex exponentials is Af = 0.02, suggesting a time-window 
length of at least M = 50 to resolve them with a DFT-based method. The minimum-variance 
spectrum estimator, however, is able to resolve them at the M = 32 window length, for which 
they are clearly not distinguishable using the DFT-based method. On the other hand, the all-pole 
spectrum estimate is able to resolve the two complex exponentials even for as low an order as 
M = 16, for which the minimum-variance spectrum was not successful. In general, the superior 
resolving capability of the all-pole model over the minimum-variance spectrum estimator is 
due to an averaging effect that comes about through the nonparametric nature of the minimum- 
variance method. This subject is explored following the next example. Note that the estimated 
noise level is most accurately measured by the minimum-variance method in all cases. Recall 
that the signal amplitude was \/10, yet the estimated power at the frequencies of the complex 
exponentials increases as the window length M increases. In the filter bank interpretation of 
the minimum-variance spectrum estimator, the normalization assumed a constant signal power 
level across the bandwidth of the frequency-selective filter. However, the complex exponential 
is actually an impulse in frequency and has zero bandwidth. Therefore, the estimated power 
will grow with the length of the time window used for the spectrum estimator as a result of 
this bandwidth normalization. The gain imparted on a complex exponential signal is explored in 
Example 9.5.2. 


EXAMPLE 9.5.2. Consider the complex exponential signal with frequency f| contained in noise 
x(n) = ayel2t fin + w(n) 


where a1 = | le/ Viisa complex number with constant amplitude |a@1| and random phase wv; 
with uniform distribution over [0, 277]. The correlation matrix of x(n) is 


Ry = leq i?v( fv (ft) +931 
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FIGURE 9.19 
Comparison of the minimum-variance (solid line), all-pole (large dashed line), and Fourier-based 
(small dashed line) spectrum estimators for different time window lengths M. 


Using the matrix inversion lemma from Appendix B, we can write the inverse of the correlation 
matrix as 


Rv! = 


Ls la Pv fv" (fi) ely la |? 
02, ot, fo2, + lay l2v(fw4 (fil 02, 03, + M | 


arrow no| 


Substituting this expression for the inverse of the correlation matrix into (9.5.11) for the minimum- 
variance spectrum estimate, we have 


2 
M Ow 


vif Ry vi), aiP/M : 
1 a ine MDI 


ROM (ei2t fi) _ 


Recall that the norm of the frequency vector v(f) from (9.5.6) is yi (f1) Vf) = M. Therefore, 
the minimum-variance power spectrum estimate at f = /; is 


ROM) (et fl) = 92, + Mla? 


that is, the sum of the noise power and the signal power times the time-window length. This gain 
of M on the signal power comes about through the normalization we imposed on our filter in 
(9.5.8). This normalization assumed the signal had equal amplitude across the passband of the 
filter. A complex exponential, on the other hand, has no bandwidth and thus this normalization 
imparts a gain of M on the signal. Therefore, if an estimate of the amplitude of a complex 
exponential is desired, this gain must be accounted for. Last, let us examine the behavior of the 
minimum variance spectrum estimator at the other frequencies that contain only noise. In the 


case of M > 1, then vif) v(f,) © 0 and 


AO (eS) mo 


Relationship between the minimum-variance and all-pole spectrum estimation 
methods 


The minimum-variance spectrum estimator has an interesting relation to the all-pole 
spectrum estimator discussed in Section 9.4. Recall from (9.5.11) that the minimum-variance 
spectrum estimate is a function of R,!. The inverse of a Toeplitz correlation matrix was 
studied in Chapter 7 and from (7.7.8) can be written as an LDL” decomposition 


R,!=A"D'A (9.5.16) 
where the upper triangular matrix A from (7.7.9) is given by 
ae ee 2 
al gr Pe 
A=]. . : ee (9.5.17) 
0 0 0 Juke Geet 
0 0 0 | 
and the diagonal matrix D is 
D = diag {Py, Pu_1,.--. Pi} (9.5.18) 


Recall from Chapter 7 that the columns of the lower triangular factor L = A” are the 
coefficients of the forward linear predictors of ordersm = 1,2,..., M—1 forthe signal x(n) 
with correlation matrix R,. P,, is the residual output power resulting from the application of 
this mth-order forward linear predictor to the signal x (7). In turn, the forward linear predictor 
coefficients form the mth-order all-pole model. The model orders are found in descending 
order as the column index increases. Let us denote the column vector of coefficients for the 
mth-order all-pole model as 


a= (La ar ase (9.5.19) 


We can write the estimate of the spectrum derived from an mth-order all-pole model in 
vector notation as 


Pn 
vil (fam? 
where V,,(f) is the frequency vector from (9.5.6) of order M = m. Then we can substitute 
(9.5.16) into the minimum-variance spectrum estimator from (9.5.11) to obtain 


M a M 
vACARe vu(f)  vi(PA"D-!Ava(f) 


Therefore, we can write the following relationship between the reciprocals of the minimum- 
variance and all-pole model spectrum estimators 


REP ePTS) = (9.5.20) 


Re (e/?*f) = 


(9.5.21) 


M M 
es In (f)am|? _ ot (0.5.22) 
RO (e?22S) ey MP» M ee REP) (ei27f) 


where the subscripts denote the order of the respective spectrum estimators. Thus, the 
minimum-variance spectrum estimator for a filter of length M is formed by averaging 
spectrum estimates from all-pole models of orders 1 through M. Note that the resolving 
capabilities of the all-pole model improve with increasing model order. As a result, the 
resolution of the minimum-variance spectrum estimator must be worse than that of the 
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Mth-order all-pole model as we observed in Example 9.5.1. However, on the other hand, this 
averaging of all-pole model spectra indicates a lower variance for the minimum-variance 
spectrum estimator. 


9.6 HARMONIC MODELS AND FREQUENCY ESTIMATION TECHNIQUES 


The pole-zero models we have discussed so far assume a linear time-invariant system that is 
excited by white noise. However, in many applications, the signals of interest are complex 
exponentials contained in white noise for which a sinusoidal or harmonic model is more 
appropriate. Signals consisting of complex exponentials are found as formant frequencies 
in speech processing, moving targets in radar, and spatially propagating signals in array 
processing. For real signals, complex exponentials make up a complex conjugate pair 
(sinusoids), whereas for complex signals, they may occur at a single frequency. 

For complex exponentials found in noise, the parameters of interest are the frequencies 
of the signals. Therefore, our goal is to estimate these frequencies from the data. One might 
consider estimating the power spectrum by using the nonparametric methods discussed 
in Chapter 5 or the minimum-variance spectral estimate from Section 9.5. The frequency 
estimates of the complex exponentials are then the frequencies at which peaks occur in the 
spectrum. Certainly, the use of these nonparametric methods seems appropriate for com- 
plex exponential signals since they make no assumptions about the underlying process. We 
might also consider making use of an all-pole model for the purposes of spectrum estima- 
tion as discussed in Section 9.4.1, also known as the maximum entropy method (MEM) 
spectral estimation technique. Even though some of these methods can achieve very fine 
resolution, none of these methods accounts for the underlying model of complex exponen- 
tials in noise. As in all modeling problems, the use of the appropriate model is desirable 
from an intuitive point of view and advantageous in terms of performance. We begin by 
describing the harmonic signal model, deriving the model in a vector notation, and looking 
at the eigendecomposition of the correlation matrix of complex exponentials in noise. Then 
we describe frequency estimation methods based on the harmonic model: the Pisarenko 
harmonic decomposition, and the MUSIC, minimum-norm, and ESPRIT algorithms. 

These methods have the ability to resolve complex exponentials closely spaced in fre- 
quency and has led to the name superresolution commonly being associated with them. 
However, a word of caution on the use of these harmonic models. The high level of per- 
formance in terms of resolution is achieved by assuming an underlying model of the data. 
As with all other parametric methods, the performance of these techniques depends upon 
how closely this mathematical model matches the actual physical process that produced 
the signals. Deviations from this assumption result in model mismatch and will produce 
frequency estimates for a signal that may not have been produced by complex exponentials. 
In this case, the frequency estimates have little meaning. 


9.6.1 Harmonic Model 


Consider the signal model that consists of P complex exponentials in noise 


P 
x(n) = Ds, apel?™"Ir + w(n) (9.6.1) 
p=1 


‘In array processing, a spatially propagating wave produces a complex exponential signal as measured across 
uniformly spaced sensors in an array. The frequency of the complex exponential is determined by the angle of 
arrival of the impinging, spatially propagating signal. Thus, in array processing the frequency estimation problem is 
known as angle-of-arrival (AOA) or direction-of-arrival (DOA) estimation. This topic is discussed in Section 11.7. 


The normalized, discrete-time frequency of the pth component is 
Op _ Fp 
fo = oe (9.6.2) 
where w» is the discrete-time frequency in radians, F’y is the actual frequency of the pth 
complex exponential, and F; is the sampling frequency. The complex exponentials may 
occur either individually or in complex conjugate pairs, as in the case of real signals. In 
general, we want to estimate the frequencies and possibly also the amplitudes of these 
signals. Note that the phase of each complex exponential is contained in the amplitude, that 
is, 

ap = laple’¥? (9.6.3) 
where the phases yy, are uncorrelated random variables uniformly distributed over [0, 27]. 
The magnitude |a,| and the frequency f,, are deterministic quantities. If we consider the 
spectrum of a harmonic process, we note that it consists of a set of impulses with a constant 
background level at the power of the white noise Cee = E{|w(n)|"}. Asa result, the power 


spectrum of complex exponentials is commonly referred to as a line spectrum, as illustrated 
in Figure 9.20. 


R(e!??7S) FIGURE 9.20 
A The spectrum of complex exponentials in 
noise. 
Noise 
level 
> 
Frequency 


Since we will make use of matrix methods based on a certain time window of length 
M, itis useful to characterize the signal model in the form of a vector over this time window 
consisting of the sample delays of the signal. Consider the signal x(n) from (9.6.1) at its 
current and future M — 1 values. This time window can be written as 


x(n) = [x(n) x(n +1) --- xm+M — is? (9.6.4) 


We can then write the signal model consisting of complex exponentials in noise from (9.6.1) 
for a length-M time-window vector as 


P 
x(n) = Yo apv( frei"! + win) = s(n) + w(n) (9.6.5) 
p= 
where w(n) = [w(n) w(n+ 1) --- w(n+M — 1)]" is the time-window vector of white 
noise and 
vf) = (Leif... ef ?mMDATT (9.6.6) 


is the time-window frequency vector. Note that v(f) is simply a length-M DFT vector 
at frequency f. We differentiate here between the signal s(m), consisting of the sum of 
complex exponentials, and the noise component w(7), respectively. 

Consider the time-window vector model consisting of a sum of complex exponentials 
in noise from (9.6.5). The autocorrelation matrix of this model can be written as the sum of 
signal and noise autocorrelation matrices 


R, = E{x(n)x" (n)} = Ry + Ry 


P 
=o laplPv(fp)v" (fp) + on 1 = VAV" + 05,1 et) 


p=1 
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where VSG) yvGa VP) (9.6.8) 


is an M x P matrix whose columns are the time-window frequency vectors from (9.6.6) at 
frequencies f;, of the complex exponentials and 


|r | 0) ... 0 
0 fiza| Fae 23 
A=]. ; (9.6.9) 
. . . 0) 
0) td 0 la p|? 


is a diagonal matrix of the powers of each of the respective complex exponentials. The 
autocorrelation matrix of the white noise is 


R, = 021 (9.6.10) 


which is full rank, as opposed to Rs; which is rank-deficient for P < M. In general, we will 
always choose the length of our time window M to be greater than the number of complex 
exponentials P. 

The autocorrelation matrix can also be written in terms of its eigendecomposition 


M 
Ry = > Amdnaid = QAQ” (9.6.11) 
m=1 

where i,,, are the eigenvalues in descending order, that is, A] > Az > --- > Ay, and q,, are 
their corresponding eigenvectors. Here A is a diagonal matrix made up of the eigenvalues 
found in descending order on the diagonal, while the columns of Q are the corresponding 
eigenvectors. The eigenvalues due to the signals can be written as the sum of the signal 
power in the time window and the noise: 


Xm = Mlam|? +02, for m<P (9.6.12) 
The remaining eigenvalues are due to the noise only, that is, 
Am = 0%, for m>P (9.6.13) 


Therefore, the P largest eigenvalues correspond to the signal made up of complex expo- 
nentials and the remaining eigenvalues have equal value and correspond to the noise. Thus, 
we can partition the correlation matrix into portions due to the signal and noise eigen- 
vectors 


P M 
R, = > (Mam? + oO )4m Gn + > oO 4m Un 
m=1 m=P+1 (9.6.14) 


a Q;A;Q” ot 77, QwQe 


where Q; = [qi q2 --- qr] Qu = [qre+i --: qu) (9.6.15) 


are matrices whose columns consist of the signal and noise eigenvectors, respectively. The 
matrix A, isa P x P diagonal matrix containing the signal eigenvalues from (9.6.12). Thus, 
the M-dimensional subspace that contains the observations of the time-window signal vector 
from (9.6.5) can be split into two subspaces spanned by the signal and noise eigenvectors, 
respectively. These two subspaces, known as the signal subspace and the noise subspace, 
are orthogonal to each other since the correlation matrix is Hermitian symmetric.’ All the 
subspace methods discussed later in this section rely on the partitioning of the vector space 
into signal and noise subspaces. Recall from Chapter 8 in (8.2.29) that the projection matrix 


t ° ae ; ‘ 
The eigenvectors of a Hermitian symmetric matrix are orthogonal. 


from an M-dimensional space onto an L-dimensional subspace (L < M) spanned by a set 
of vectors Z = [z, z2 --- zy] is 


P=2Z7(2"72Z)'Z4 (9.6.16) 


Therefore, we can write the matrices that project an arbitrary vector onto the signal and 
noise subspaces as 


P,=Q,Q” — P, = QQ? (9.6.17) 


since the eigenvectors of the correlation matrix are orthonormal (Q” Q, = Iand Qt Qy = 
I). Since the two subspaces are orthogonal 


PQ; =0 P;Qu =0 (9.6.18) 


then all the time-window frequency vectors from (9.6.5) must lie completely in the signal 
subspace, that is, 


Psv(fp) = V(fp) Pwv(fp) = 0 (9.6.19) 


These concepts are central to the subspace-based frequency estimation methods discussed 
in Sections 9.6.2 through 9.6.5. 

Note that in our analysis, we are considering the theoretical or true correlation matrix 
R,. In practice, the correlation matrix is not known and must be estimated from the measured 
data samples. If we have a time-window signal vector from (9.6.4), then we can form the 
data matrix by stacking the rows with measurements of the time-window data vector at a 
time n 


x! (0) x(0) x(1) ss. x(M — 1) 
x’ (1) x(1) x(2) .++ x(M) 
X = | x? (n) = | x(n) x(n+1) ++) x(n+M-—1) (9.6.20) 
x! (N — 2) x(N—2) x(N—-1) --- x(N+M-—3) 
x! (N — 1) x(N—1) x(N) -) x(N+M—2) 


which has dimensions of N x M, where N is the number of data records or snapshots and 
M is the time-window length. From this matrix, we can form an estimate of the correlation 
matrix, referred to as the sample correlation matrix 


7 low 
R, = —x"x (9.6.21) 
N 

In the case of an estimated sample correlation matrix, the noise eigenvalues are no longer 
equal because of the finite number of samples used to compute R. Therefore, the nice, clean 
threshold between signal and noise eigenvalues, as described in (9.6.12) and (9.6.13), no 
longer exists. The model order estimation techniques discussed in Section 9.2 can be em- 
ployed to attempt to determine the number of complex exponentials P present. In practice, 
these methods are best used as rough estimates, as their performance is not very accurate, 
especially for short data records. 

For several of the frequency estimation techniques described in this section, the analysis 
considers the use of eigenvalues and eigenvectors of the correlation matrix for the purposes 
of defining signal and noise subspaces.’ In practice, we estimate the signal and noise sub- 
spaces by using the eigenvectors and eigenvalues of the sample correlation matrix. Note that 
for notational expedience we will not differentiate between eigenvectors and eigenvalues of 


"The ESPRIT method uses a singular value decomposition of data matrix X. 
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the true and sample correlation matrices. However, the reader should always keep in mind 
that the sample correlation matrix eigendecomposition is what must be used for implemen- 
tation. We note that use of an estimate rather than the true correlation matrix will result in 
a degradation in performance, the analysis of which is beyond the scope of this book. 


9.6.2 Pisarenko Harmonic Decomposition 


The Pisarenko harmonic decomposition (PHD) was the first frequency estimation method 
proposed that was based on the eigendecomposition of the correlation matrix and its parti- 
tioning into signal and noise subspaces (Pisarenko 1973). This method uses the eigenvector 
associated with the smallest eigenvalue to estimate the frequencies of the complex expo- 
nentials. Although this method has limited practical use owing to its sensitivity to noise, 
it is of great theoretical interest because it was the first method based on signal and noise 
subspace principles and it helped to fuel the development of many well-known subspace 
methods, such as MUSIC and ESPRIT. 

Consider the model of complex exponentials contained in noise in (9.6.5) and the 
eigendecomposition of its correlation matrix in (9.6.14). The eigenvector corresponding to 
the minimum eigenvalue must be orthogonal to all the eigenvectors in the signal subspace. 
Thus, we choose the time window to be of length 


M=P+1 (9.6.22) 


that is, 1 greater than the number of complex exponentials. Therefore, the noise subspace 
consists of a single eigenvector 


Qu = qu (9.6.23) 


corresponding to the minimum eigenvalue Ay. By virtue of the orthogonality between the 
signal and noise subspaces, each of the P complex exponentials in the time-window signal 
vector model in (9.6.5) is orthogonal to this eigenvector 


M 
v" (fam = Yate 77%) =0 for m<P (9.6.24) 
k=1 
Making use of this property, we can compute 


Rpna(e??" 7) = ! = : (9.6.25) 
: V4 Aaul? — |Oule?)/? ~ 
which is commonly referred to as a pseudospectrum. The frequencies are then estimated 
by observing the P peaks in Rona (e/27F). Note that since (9.6.25) requires a search of all 
frequencies —0.5 < f < 0.5, in practice a dense sampling of the frequencies is generally 
necessary. The quantity 


M 
Omle?"4) =v" (fyam = > au(be "FO (9.6.26) 
k=1 
is simply the Fourier transform of the Mth eigenvector corresponding to the minimum eigen- 
value. Thus, the pseudospectrum for the Pisarenko harmonic decomposition Rphd (e/2*f) 
can be efficiently implemented by computing the FFT of qy with sufficient zero padding 
to provide the necessary frequency resolution. Then Rona (e/?*/) is simply the reciprocal 
of the spectrum of the noise eigenvector, that is, the squared magnitude of its Fourier trans- 
form. Note that Rohd (e/27F) is not an estimate of the true power spectrum since it contains 
no information about the powers of the complex exponentials |aw al? or the background 
noise level Go. However, these amplitudes can be found by using the estimated frequen- 
cies and the corresponding time-window frequency vectors along with the relationship of 
eigenvalues and eigenvectors. See Problem 9.24 for details. 


Alternately, the frequencies of the complex exponentials can be found by computing 
the zeros of the Fourier transform of the Mth eigenvector in (9.6.23). The z-transform of 
this eigenvector is 


M M-1 

Om) = dame * = [ [ae z1) (9.6.27) 
k=1 k=1 

where the phases of the P = M — | roots of this polynomial are the frequencies f; of the 

P = M — 1\ complex exponentials. 

As we Stated up front, the significance of the Pisarenko harmonic decomposition is seen 
mostly from a theoretical perspective. The limitations of its practical use stem from the fact 
that it uses a single noise eigenvector and, as a result, lacks the necessary robustness needed 
for most applications. Since the correlation matrix is not known and must be estimated from 
data, the resulting noise eigenvector of the estimated correlation matrix is only an estimate 
of the actual noise eigenvector. Because we only use one noise eigenvector, this method is 
very sensitive to any errors in the estimation of the noise eigenvector. 


EXAMPLE 9.6.1. We demonstrate the use of the Pisarenko harmonic decomposition with a sinu- 
soid in noise. The amplitude and frequency of the sinusoid are w = 1 and f = 0.2, respectively. 
The additive noise has unit power (02, = 1). Using MATLAB, this signal is generated: 


x = sin(2*pi*f*[0:(N-1)]’) + (randn(N,1)+j*randn(N,1))/sqrt (2); 


Since the number of complex exponentials is equal to P = 2 (a complex conjugate pair 
for a sinusoid), the time-window length is chosen to be M = 3. After forming the N x M data 
matrix X and computing the sample correlation matrix R;, we can compute the pseudospectrum 
as follows: 


[Q0,D] = eig(R); % eigendecomposition 

[lambda, index] = sort (abs(diag(D))); % order by eigenvalue magnitude 
lambda = lambda(M:-1:1); Q=Q0(:,index(M:-1:1)); 

Rbar = 1./abs(fftshift (fft (Q(:,M),Nfft))).*2; 


Figure 9.21 shows the pseudospectrum of the Pisarenko harmonic decomposition for a 
single realization with an FFT size of 1024. Note the two peaks near f = +0.2. Recall that 


70 
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FIGURE 9.21 
Pseudospectrum for the Pisarenko harmonic decomposition of a 
sinusoid in noise with frequency f = 0.2. 
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this is a pseudospectrum, so that the actual values do not correspond to an estimate of power. 
A Matra routine for estimating frequencies using the Pisarenko harmonic decomposition is 
provided in phd.m. 


9.6.3 MUSIC Algorithm 


The multiple signal classification (MUSIC) frequency estimation method was proposed 
as an improvement on the Pisarenko harmonic decomposition (Bienvenu and Kopp 1983; 
Schmidt 1986). Like the Pisarenko harmonic decomposition, the M-dimensional space is 
split into signal and noise components using the eigenvectors of the correlation matrix from 
(9.6.15). However, rather than limit the length of the time window to M = P + 1, that 
is, | greater than the number of complex exponentials, allow the size of the time window 
to be M > P + 1. Therefore, the noise subspace has a dimension greater than |. Using 
this larger dimension allows for averaging over the noise subspace, providing an improved, 
more robust frequency estimation method than Pisarenko harmonic decomposition. 

Because of the orthogonality between the noise and signal subspaces, all the time- 
window frequency vectors of the complex exponentials are orthogonal to the noise subspace 
from (9.6.19). Thus, for each eigenvector (P <m < M) 


M 


v"(fp)dm = >. gm(kyeP7 IPED = 0 (9.6.28) 
k=1 


for all the P frequencies f, of the complex exponentials. Therefore, if we compute a 
pseudospectrum for each noise eigenvector as 

Rn (eP?7F) = ! = ; (9.6.29) 

7 = = aa 6. 
Iv? (f)dml? — |Qm (ei?) |? 

the polynomial Q, (e/27/) has M —1 roots, P of which correspond to the frequencies of the 
complex exponentials. These roots produce P peaks in the pseudospectrum from (9.6.29). 
Note that the pseudospectra of all M — P noise eigenvectors share these roots that are due 
to the signal subspace. The remaining roots of the noise eigenvectors, however, occur at 
different frequencies. There are no constraints on the location of these roots, so that some 
may be close to the unit circle and produce extra peaks in the pseudospectrum. A means of 
reducing the levels of these spurious peaks in the pseudospectrum is to average the M — P 
pseudospectra of the individual noise eigenvectors 


1 1 


Rmusie(e!7/) = — = (9.6.30) 
Sw Pan? 5 iene 
m=P+1 m=P+1 


which is known as the MUSIC pseudospectrum. The frequency estimates of the P complex 
exponentials are then taken as the P peaks in this pseudospectrum. Again, the term pseu- 
dospectrum is used because the quantity in (9.6.30) does not contain information about the 
powers of the complex exponentials or the background noise level. Note that for MW = P+1, 
the MUSIC method is equivalent to Pisarenko harmonic decomposition. 

The implicit assumption in the MUSIC pseudospectrum is that the noise eigenvalues 
all have equal power Aj, = Ce that is, the noise is white. However, in practice, when an 
estimate is used in place of the actual correlation matrix, the noise eigenvalues will not be 
equal. The differences become more pronounced when the correlation matrix is estimated 
from a small number of data samples. Thus, a slight variation on the MUSIC algorithm, 
known as the eigenvector (ev) method, was proposed to account for the potentially different 


noise eigenvalues (Johnson and DeGraaf 1982). For this method, the pseudospectrum is 
1 1 


Rey(el®) = —, is (9.6.31) 
Y> Ww (fam? YS —1Om(e?*4)/? 
dm Am 
m=P+1 k=P+1 


where i,, is the eigenvalue corresponding to the eigenvector q,,. The pseudospectrum of 
each eigenvector is normalized by its corresponding eigenvalue. In the case of equal noise 
eigenvalues (A, = o2,) for P+ 1 <m < M, the eigenvector and MUSIC methods are 
identical. 

The peaks in the MUSIC pseudospectrum correspond to the frequencies at which the 
denominator in (9.6.30) De p11 lQm (e/*f)|? approaches zero. Therefore, we might 
want to consider the z-transform of this denominator 


M 
Puusic(2)= >> Om(z)O%, (=) (9.6.32) 
m=P+1 

which is the sum of the z-transforms of the pseudospectrum due to each noise eigenvector. 
This (2M — 1)th-order polynomial has M — 1 pairs of roots with one inside and one 
outside the unit circle. Since we assume that the complex exponentials are not damped, 
their corresponding roots must lie on the unit circle. Thus, if we have found the M — 1 
roots of (9.6.32), the P closest roots to the unit circle will correspond to the complex 
exponentials. The phases of these roots are then the frequency estimates. This method of 
rooting the polynomial corresponding to the MUSIC pseudospectrum is known as root- 
MUSIC (Barabell 1983). Note that in many cases, a rooting method is more efficient than 
computing a pseudospectrum at a very fine frequency resolution that may require a very 
large FFT. Statistical performance analyses of the MUSIC algorithm can be found in Kaveh 
and Barabell (1986) and Stoica and Nehorai (1989). For the performance of the root-MUSIC 
method see Rao and Hari (1989). A routine for the MUSIC algorithm is provided in music.m 
and a routine for the root- MUSIC algorithm is provided in rootmusic.m. 


EXAMPLE 9.6.2. In this example, we demonstrate the use of the MUSIC algorithm and examine 
its performance in terms of resolution with respect to that of the minimum-variance spectral 
estimator. Consider the following scenario: Two complex exponentials in unit power noise (02, = 
1) with normalized frequencies f = 0.1,0.2 both with amplitudes of a = 1. We generate 
N = 128 samples of the signal and use a frequency vector of length M = 8. Proceeding as we 
did in Example 9.6.1, we compute the eigendecomposition and partition it into signal and noise 
subspaces. The MUSIC pseudospectrum is computed as 


Qbar = zeros(Nfft,1); 
for n = 1: (M-P) 
Qbar = Qbar + abs(fftshift (fft (Q(:,M-(n-1)),Nfft))).°*2; 
end 
Rbar = 1./Qbar; 


The minimum-variance spectral estimate and the MUSIC pseudospectrum are computed and 
averaged over 1000 realizations using an FFT size of 1024. The result is shown in Figure 9.22. 
The two exponentials have been clearly resolved using the MUSIC algorithm, whereas they 
are not very clear using the minimum-variance spectral estimate. Since the minimum-variance 
spectral estimator is nonparametric and makes no assumptions about the underlying model, it 
cannot achieve the resolution of the MUSIC algorithm. 


9.6.4 Minimum-Norm Method 


The minimum-norm method (Kumaresan and Tufts 1983), like the MUSIC algorithm, uses 
a time-window vector of length M > P + 1 for the purposes of frequency estimation. For 
MUSIC, a larger time window is used than for Pisarenko harmonic decomposition, resulting 
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FIGURE 9.22 

Comparison of the minimum-variance spectral estimate (dashed line) 
and the MUSIC pseudospectrum (solid line) for two complex 
exponentials in noise. 


in a larger noise subspace. The use of a larger subspace provides the necessary robustness 
for frequency estimation when an estimated correlation matrix is used. The same principle is 
applied in the minimum-norm frequency estimation method. However, rather than average 
the pseudospectra of all the noise subspace eigenvectors to reduce spurious peaks, as in the 
case of the MUSIC algorithm, a different approach is taken. 

Consider a single vector u contained in the noise subspace. The pseudospectrum of this 
vector is given by 

1 
lv" (ful? 
Since the vector u lies in the noise subspace, its pseudospectrum in (9.6.33) has P peaks 
corresponding to the complex exponentials in the signal subspace. However, uis length M so 
that its pseudospectrum may exhibit an additional M — P — 1 peaks that do not correspond 
to the frequencies of the complex exponentials. These spurious peaks lead to frequency 
estimation errors. In the case of Pisarenko harmonic decomposition, spurious peaks were 
not a concern since M = P + 1 and therefore its pseudospectrum in (9.6.25) only had 
P peaks. On the other hand, the MUSIC algorithm diluted the strength of these spurious 
peaks since its pseudospectrum in (9.6.30) is produced by averaging the pseudospectra of 
the M — P noise eigenvectors. 

Recall the projection onto the noise subspace from (9.6.17) is 


Py = QQ? (9.6.34) 


where Q,, is the matrix of noise eigenvectors. Therefore, for any vector u that lies in the 
noise subspace 


Rte?) = (9.6.33) 


P,u=u P,u=0 (9.6.35) 
where P, is the signal subspace projection matrix and 0 is the length- P zero vector. Now 
let us consider the z-transform of the coefficients of u = [u(1) u(2) --- u(M)]" 


M-1 


P M-1 
U@) = dl ukt De" =[Ja-e 27) T] a-ac) (9.6.36) 
k=1 


k=0 k=P+1 


This polynomial is the product of the P roots corresponding to complex exponentials that 
lie on the unit circle and the M — P — 1 roots that in general do not lie directly on the unit 
circle but can potentially produce spurious peaks in the pseudospectrum of u. Therefore, 
we want to choose u so that it minimizes the spurious peaks due to these other roots of its 
associated polynomial U(z). 

The minimum-norm method, as its name implies, seeks to minimize the norm of u in 
order to avoid spurious peaks in its pseudospectrum. Using (9.6.35), the norm of a vector 
u contained in the noise subspace is 


H 


jul? = uu =u" P,u (9.6.37) 


However, an unconstrained minimization of this norm will produce the zero vector. There- 


fore, we place the constraint that the first element of u must equal 1.’ This constraint can 
be expressed as 


sfu=l (9.6.38) 
where 6; = [1 0 --- 0]”. Then the determination of the minimum-norm vector comes 
down to solving the following constrained minimization problem: 

min |ju||? =u? P,,u subject to stu =] (9.6.39) 


The solution can be found by using Lagrange multipliers (see Appendix B) and is given by 
7 P51 
— 6EP,, 81 
The frequency estimates are then obtained from the peaks in the pseudospectrum of the 
minimum-norm (mn) vector, Umn 


(9.6.40) 


Umn 


1 
lv? (f)Umnl? 
The performance of the minimum-norm frequency estimation method is similar to that 
of MUSIC. For a performance comparison see Kaveh and Barabell (1986). Note that it is 


also possible to implement the minimum-norm method by rooting a polynomial rather than 
computing a psuedospectrum (see Problem 9.25). 


Rma(el?7/) = (9.6.41) 


EXAMPLE 9.6.3. In this example, we illustrate the use of the minimum-norm method and compare 
its performance to that of the other three frequency estimation methods discussed in this chapter: 
Pisarenko harmonic decomposition, the MUSIC algorithm, and the eigenvector method. The 
pseudospectrum of the minimum-norm method is found by first computing the minimum-norm 
vector Umn and then finding its pseudospectrum, that is, 


deltal = zeros(M,1); deltal(1) i 

Pn=Q(:, (P+1):M)*Q(:, (P+1):M)’; % noise subspace projection matrix 
u = (Pn*el)/(el’*Pn*el); % minimum-norm vector 

Rbar = 1./abs(fftshift (fft (u,Nfft))).*2; % pseudospectrum 


a 


Consider the case of P = 4 complex exponentials in noise with frequencies f = 0.1, 0.25, 
0.4, and —0.1, all with an amplitude of a = 1. The power of the noise is set to a2, = 1 with 
100 realizations. The time-window length used was M = 8 for all the methods except Pisarenko 
harmonic decomposition, which is constrained to use M = P +1 = 5. The pseudospectra 
are shown in Figure 9.23 with an FFT size of 1024, where we have not averaged in order 
to demonstrate the variance of the various methods. Here we see the large variance in the 
frequency estimates that is produced by Pisarenko harmonic decomposition compared to the other 
methods, which is a direct result of using a one-dimensional noise subspace. The other methods all 
perform comparably in terms of estimating the frequencies of the complex exponentials. Note the 
fluctuations in the pseudospectrum of the eigenvector method that result from the normalization 


The choice of a value of | is somewhat arbitrary, since any nonzero constant will result in a similar solution. 
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FIGURE 9.23 


Comparison of the eigendecomposition-based frequency estimation methods: (a) Pisarenko 
harmonic decomposition, (b) MUSIC, (c) eigenvector method, and (d) minimum-norm method. 


by the eigenvalues. Since these eigenvalues vary over realizations, the pseudospectra will also 
reflect a similar variation. Routines for the eigenvector method and the minimum-norm method 
are provided in ev_method.m and minnorm.m, respectively. 


9.6.5 ESPRIT Algorithm 


A frequency estimation technique that is built upon the same principles as other subspace 
methods but further exploits a deterministic relationship between subspaces is the estimation 
of signal parameters via rotational invariance techniques (ESPRIT) algorithm. This method 
differs from the other subspace methods discussed so far in this chapter in that the signal 
subspace is estimated from the data matrix X rather than the estimated correlation matrix 
R,. The essence of ESPRIT lies in the rotational property between staggered subspaces 
that is invoked to produce the frequency estimates. In the case of a discrete-time signal or 
time series, this property relies on observations of the signal over two identical intervals 
staggered in time. This condition arises naturally for discrete-time signals, provided that the 
sampling is performed uniformly in time.’ Extensions of the ESPRIT method to a spatial 


This condition is violated in the case of a nonuniformly sampled time series. 


array of sensors, the application for which it was originally proposed, will be discussed 489 

in Chapter 11 in Section 11.7. We first describe the original, least-squares version of the — sEcTION 9.6 | 
algorithm (Roy et al. 1986) and then extend the derivation to total least-squares ESPRIT —- Harmonic Models and 
(Roy and Kailath 1989), which is the preferred method for use. Since the derivation of the ae a 
algorithm requires an extensive amount of formulation and matrix manipulations, we have 

included a block diagram in Figure 9.24 to be used as a guide through this process. 


Unknown 


Signal model | 


P F 
s(n) =X ae/27p" 
p=1?P 


Time-window 
signal vector model 


Matching ~ oy 
signal f= ~— 
subspace I<p<P 


eigenvalues 
of ¥ 


Data matrix | 


Compute 
od 
(LS or TLS) 


U,=U,¥ 


Separate 
signal & noise Partition into 
subspaces staggered subspaces 


<P> 


<— u— 


FIGURE 9.24 
Block diagram demonstrating the flow of the ESPRIT algorithm starting from the data matrix 
through the frequency estimates. 


Consider a single complex exponential so(n) = e/?”/" with complex amplitude a and 
frequency f. This signal has the following property 


so(n +1) = ael**™FO4) = sy(nyel?* Ff (9.6.42) 


that is, the next sample value is a phase-shifted version of the current value. This phase shift 
can be represented as a rotation on the unit circle e/?”/. Recall the time-window vector 
model from (9.6.4) consisting of a signal s(n), made up of complex exponentials, and the 
noise component w(n) 


P 
x(n) =) apv(fpel?™”? + w(n) = VO" a + win) = s(n) + w(n) (9.6.43) 
p=1 


where the P columns of matrix V are length-M time-window frequency vectors of the 
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complex exponentials 


VS PQ vG2) «= ¥Ue)I (9.6.44) 


The vector & consists of the amplitudes of the complex exponentials a ,. On the other hand, 
matrix ® is the diagonal matrix of phase shifts between neighboring time samples of the 
individual, complex exponential components of s(7) 


el2T fi Q ... 0 
0 e272... 0 
® = diag {6),$2,...,dp} =|. — (9.6.45) 
(0) sack 0  eftt fr 
where ¢, = e/°* fy for p = 1,2,..., P. Since the frequencies of the complex exponentials 


fp completely describe this rotation matrix, frequency estimates can be obtained by finding 
®. Let us consider two overlapping subwindows of length M — 1 within the length M@ 
time-window vector. This subwindowing operation is illustrated in Figure 9.25. Consider 
the signal consisting of the sum of complex exponentials 


s(n) = | #1) ane (9.6.46) 
s(n+M—1) Su—1(n+ 1) 


where Sjy_1(7) is the length-(M — 1) subwindow of s(7), that is, 


Su—1(n) => Vu_10"a (9.6.47) 


x(n) 


Ls | 


n n+M-1 n+M 
n+1 
Xy_\) 
Xy_1( + 1) 
FIGURE 9.25 


Time-staggered, overlapping windows used by the ESPRIT algorithm. 


Matrix V jy_ is constructed in the same manner as V except its ttme-window frequency 
vectors are of length M — 1, denoted as vy_1(f), 


Vu-1 = [vu-1(f1) Vu—-1(f2) --> Vu—-1(fP)] (9.6.48) 


Recall that s(7) is the scalar signal made up of the sum of complex exponentials at time n. 
Using the relation in (9.6.47), we can define the matrices 


Vi = Vu_-1 2" and V2 = Vy_)e"t! (9.6.49) 


where V, and V>2 correspond to the unstaggered and staggered windows, that is, 


N kK - ‘| 
VO = = (9.6.50) 
KK oe ok Vo 


Clearly, by examining (9.6.49), these two matrices of time-window frequency vectors are 
related as 


V2 =Vi® (9.6.51) 


Note that each of these two matrices spans a different, though related, (M — 1)-dimensional 
subspace. 

Now suppose that we have a data matrix X from (9.6.20) with N data records of the 
length-M time-window vector signal x(n). Using the singular value decomposition (SVD) 
discussed in Chapter 8, we can write the data matrix as’ 


X =LrU” (9.6.52) 


where L is an N x N matrix of left singular vectors and U is an M x M matrix of right 
singular vectors. Both of these matrices are unitary; that is, L“L = I and U“U = I. The 
matrix & has dimensions N x M consisting of singular values on the main diagonal ordered 
in descending magnitude. The squared magnitudes of the singular values are equal to the 
eigenvalues of R scaled by a factor of N from (9.6.21), and the columns of U are their 
corresponding eigenvectors. Thus, U forms an orthonormal basis for the underlying M- 
dimensional vector space. This subspace can be partitioned into signal and noise subspaces 
as 


U=[U;|U,] (9.6.53) 


where U, is the matrix of right-hand singular vectors corresponding to the singular values 
with the P largest magnitudes. Note that since the signal portion consists of the sum of 
complex exponentials modeled as time-window frequency vectors v(f), all these frequency 
vectors, for f = fi, fo,..., fp, mustalso liein the signal subspace. As a result, the matrices 
V and Us span the same subspace. Therefore, there exists an invertible transformation T 
that maps U; into V, that is, 


V =U,T (9.6.54) 


The transformation T is never solved for in this derivation, but instead is only formulated 
as a mapping between these two matrices within the signal subspace. 

Proceeding as we did with the matrix V in (9.6.50), we can partition the signal subspace 
into two smaller (M — 1)-dimensional subspaces as 


he i _ i 
U; = — (9.6.55) 
KK ee Ok U2 


where U; and U2 correspond to the unstaggered and staggered subspaces, respectively. 
Since V; and V2 correspond to the same subspaces, the relation from (9.6.54) must also 
hold for these subspaces 


Vv, =U\T V2 =U2T (9.6.56) 


The staggered and unstaggered components of the matrix V in (9.6.50) are related through 
the subspace rotation ® in (9.6.51). Since the matrices U; and U2 also span these respective, 
related subspaces, a similar, though different, rotation must exist that relates (rotates) U; to 
U2 


UW =U (9.6.57) 


where W is this rotation matrix. 
Recall that frequency estimation comes down to solving for the subspace rotation 
matrix ®. We can estimate ® by making use of the relations in (9.6.56) together with the 


Our notation differs slightly from that introduced in Chapter 8 in order to avoid confusion with the matrix of 
time-window frequency vectors V. 


491 


SECTION 9.6 
Harmonic Models and 
Frequency Estimation 
Techniques 


492 


CHAPTER 9 

Signal Modeling and 
Parametric Spectral 
Estimation 


rotations between the staggered signal subspaces in (9.6.51) and (9.6.57). In this process, 
the matrices U; and U2 are known from the SVD on data matrix X. First, we solve for V 
from the relation in (9.6.57), using the method of least-squares (LS) from Chapter 8 


wv = (U7U)) UU, (9.6.58) 
Substituting (9.6.57) into (9.6.56), we have 
V2 = UoT =U, WT (9.6.59) 


Similarly, we can also solve for V2, using the relation in (9.6.51) and substituting (9.6.56) 
for Vy 


V2 =Vi®=U\T® (9.6.60) 


Thus, equating the two right-hand sides of (9.6.59) and (9.6.60), we have the following 
relation between the two subspace rotations 


WT=T® (9.6.61) 
or equivalently W=T6T! (9.6.62) 


Equations (9.6.61) and (9.6.62) should be recognized as the relationship between eigenvec- 
tors and eigenvalues of the matrix YW (Golub and Van Loan 1996). Therefore, the diagonal 
elements of ®, @ p for p = 1,2,..., P, are simply the eigenvalues of W. As a result, the 
estimates of the frequencies are 

. Lop 


fp =>" (9.6.63) 


where p isthe phase of ¢ ,. Although the principle behind the ESPRIT algorithm, namely, 
the use of subspace rotations, is quite simple, one can easily get lost in the details of the 
derivation of the algorithm. Note that we have only used simple matrix relationships. An 
illustrative example of the implementation of ESPRIT in MATLAB is given in Example 
9.6.4 to help clarify the details of the algorithm. However, first we give a total least-squares 
version of the algorithm, which is the preferred method for use. 

Note that the subspaces U, and U2 are both only estimates of the true subspaces that 
correspond to V; and V2, respectively, obtained from the data matrix X. The estimate of 
the subspace rotation was obtained by solving (9.6.57) using the LS criterion 


W, = UPU,;)'U?U, (9.6.64) 


This LS solution is obtained by minimizing the errors in an LS sense from the following 
formulation 


U2 + E, =U, (9.6.65) 


where Ep is a matrix consisting of errors between U2 and the true subspace corresponding 
to V2. Note that this LS formulation assumes errors only on the estimation of U2 and no 
errors between Uj and the true subspace that it is attempting to estimate corresponding to 
V. Therefore, since Uj is also an estimated subspace, a more appropriate formulation is 


U2 + Eo = (U; + E)) V (9.6.66) 


where E; is the matrix representing the errors between U; and the true subspace corre- 
sponding to V,. A solution to this problem, known as total least squares (TLS), is obtained 
by minimizing the Frobenius norm of the two error matrices 


|E, Eol|r (9.6.67) 


Since the principles of TLS are beyond the scope of this book, we simply give the procedure 
to obtain the TLS solution of W and refer the interested reader to Golub and Van Loan (1996). 


First, form a matrix made up of the staggered signal subspace matrices U; and U2 
placed side by side,’ and perform an SVD 


[U, U2] = LEU” (9.6.68) 


We then operate on the 2 P x 2P matrix U of right singular vectors. This matrix is partitioned 
into P x P quadrants 


x Un Up 
U=|_ 2 (9.6.69) 
Uz U2 
The TLS solution for the subspace rotation matrix W is then 
Ws = —U20z (9.6.70) 


The frequency estimates are then obtained from (9.6.62) and (9.6.63) by using Wy, from 
(9.6.70). Although the TLS version of ESPRIT involves slightly more computations, it 
is generally preferred over the LS version based on formulation in (9.6.66). A statistical 
analysis of the performance of the ESPRIT algorithms is given in Ottersten et al. (1991). 


EXAMPLE 9.6.4. In this illustrative example, we demonstrate the use of both the LS and TLS 
versions of the ESPRIT algorithm on a set of complex exponentials in white noise using MATLAB. 
First, generate a signal s(n) of length N = 128 consisting of complex exponential signals at 
normalized frequencies f = 0.1, 0.15, 0.4, and —0.15, all with amplitude a = 1. Each of the 
complex exponentials is generated by exp (j*2*pi*f*[0: (N-1)]’);. The overall signal in 
white noise with unit power (02, = 1) is then 


x = s + (randn(N,1)+3j*randn(N,1))/sqrt (2); 


We form the data matrix corresponding to (9.6.20) for a time window of length M = 8. 
The least-squares ESPRIT algorithm is then performed as follows: 


[L,S,U] = svd(X); 
Us = U(:,1:P); % signal subspace 

Ul Us (1: (M-1),:); U2 = Us(2:M,:); % signal subspaces 
Psi = U1\U2; % LS solution for Psi 


If we are using the TLS version of ESPRIT, then solve for 


[LL,SS,UU] = svd([Ul U2]); UU12 = UU(1:P, (Pt1):(2*P)); 
UU22 = UU((P+1):(2*P), (Pt1): (2*P)); 
Psi = -UU12*inv(UU22); % TLS solution for Psi 
The frequencies are found by computing the phases of the eigenvalues of W, that is, 


2 


phi = eig(Psi); % eigenvalues of Psi 
fhat = angle(diag(phi))/(2*pi); % frequency estimates 
In both cases, we average over 1000 realizations and obtain average estimated frequencies 


very close to the true values f = 0.1, 0.15, 0.4, and —0.15 used to generate the signals. Routines 
for both the LS and TLS versions of ESPRIT are provided in esprit_ls.mandesprit_tls.m. 


9.7 SUMMARY 


In this chapter, we have examined the modeling process for both pole-zero and harmonic 
signal models. As for all signal modeling problems, the procedure begins with the selection 
of the appropriate model for the signal under consideration. Then the signal model is applied 
by estimating the model parameters from a collection of data samples. However, as we 


"Note that this matrix [U; U2] # Us =[U? UZ]? from (9.6.55). 
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have stressed throughout this chapter, nothing is more valuable in the modeling process 
than specific knowledge of the signal and its underlying process in order to assess the 
validity of the model for a particular signal. For this reason, we began the chapter with a 
discussion of a model building procedure, starting with the choice of the appropriate model 
and the estimation of its parameters, and concluding with the validation of the model. 
Clearly, if the model is not well-suited for the signal, the application of the model becomes 
meaningless. 

In the first part of the chapter, we considered the application of the parametric signal 
models that were discussed in Chapter 4. The estimation of all-pole models was presented 
for both direct and lattice structures. Within this context, we used various model order 
selection criteria to determine the order of the all-pole model. However, these criteria are 
not necessarily limited to all-pole models. In addition, the relationship was given between 
the all-pole model and Burg’s method of maximum entropy. Next, we considered the pole- 
zero modeling. Using a nonlinear least-squares technique, a method was presented for 
estimating the parameters of the pole-zero model. The use of pole-zero models for the 
purposes of spectral estimation along with their application to speech modeling was also 
considered. 

The latter part of the chapter focused on harmonic signal models, that is, modeling 
signals using the sum of complex exponentials. The harmonic modeling problem becomes 
one of estimating the frequency of the complex exponentials. As a bridge between these 
pole-zero and harmonic models, we discussed the topic of minimum-variance spectral 
estimation. As will be explored in the problems that follow, there are several interesting 
relations between the minimum-variance spectrum and the harmonic models. In addition, 
a relationship between the minimum-variance spectral estimator and the all-pole model 
was established. Then, we discuss some of the more popular harmonic modeling methods. 
Starting with the Pisarenko harmonic decomposition, the first such model, we discuss the 
MUSIC, eigenvector, root-MUSIC, and minimum-norm methods for frequency estimation. 
All of these methods are based on computing a pseudospectrum or a rooting polynomial 
from an estimated correlation matrix. Finally, we give a brief derivation of the ESPRIT 
algorithm, both in its original LS form and the more commonly used TLS form. 


PROBLEMS 


9.1 Consider the random process x(n) described in Example 9.2.3 that is simulated by exciting the 
system function 


1 
1 — 2.7607z—! + 3.8108z—2 — 2.6535z—3 + 0.9238z—4 
using a WGN(0, 1) process. Generate N = 250 samples of the process x(n). 


H(z) = 


(a) Write a MATLAB function that implements the modified covariance method to obtain AR(P) 
model coefficients and the modeling error variance on as a function of P, using N samples 


of x(n). 
(b) Compute and plot the variance oa FPE(P), AIC(P), MDL(P), and CAT(P) for P = 
TQ yg dd. 


(c) Comment on your results and the usefulness of model selection criteria for the process x(n). 


9.2 Consider the Burg approach of minimizing forward-backward LS error rou in (9.2.33). 
(a) Show that by using (9.2.26) and (9.2.27), oP can be put in the form of (9.2.34). 


(b) By minimizing ae with respect to k,, 1, show that the expression for the optimum KB 
is given by (9.2.35). 
(c) Show that |kB_ || <1. 


(d) Show that |kB_ | < |kIS_ || < 1 where kIS_, is defined in (9.2.36). 


9.3 


9.4 


9.5 


9.6 


Generate an AR(2) process using the system function 


1 
1 — 0.92! + 0.812-7 
excited by aWGN(0, 1) process. Illustrate numerically that if we use the full-windowing method, 


that is, the matrix X in (9.2.8), then the PACS estimates {KEP }) _4, {KBP}! _9, and {kB}! _o of 


Section 9.2 are identical and hence can be obtained by using the Levinson-Durbin algorithm. 


H(z) = 


Generate sample sequences of an AR(2) process 
x(n) = w(n) — 1.5857x(n — 1) — 0.9604x(n — 2) 
where w(n) ~ WGN(0, 1). Choose N = 256 samples for each realization. 


(a) Design a first-order optimum linear predictor, and compute the prediction error e) (1). Test 
the whiteness of the error sequence e; (n) using the autocorrelation, PSD, and partial cor- 
relation methods, discussed in Section 9.1. Show your results as an overlay plot using 20 
realizations. 

(b) Repeat the above part, using second- and third-order linear predictors. 

(c) Comment on your plots. 


Generate sample functions of the process 
x(n) = 0.5w(n) + 0.5w(n — 1) 
where w(n) ~ WGN(0, 1). Choose N = 256 samples for each realization. 


(a) Test the whiteness of x(n) and show yourresults, using overlay plots based on 10 realizations. 
(b) Process x(n) through the AR(1) filter 


1 
1+0.95z—! 


to obtain y(n). Test the whiteness of y(n) and show your results, using overlay plots based 
on 10 realizations. 


A(z) = 


The process x(m) contains a complex exponential in white noise, that is, 
x(n) = Aes ont) 4 w(n) 


where A is a real positive constant, 0 is arandom variable uniformly distributed over [0, 277], wo 
is a constant between 0 and z, and w(n) ~ WGN(0, o2,). The purpose of this problem is to 
analytically obtain a maximum entropy method (MEM) estimate by fitting an AR(P) model 
and then evaluating {ax}d model coefficients. 


(a) Show that the (P + 1) x (P + 1) autocorrelation matrix of x(n) is given by 
Ry = A’ee# +021 


where e = [1 eJ0 a e J PeoT, 
(b) By solving autocorrelation normal equations, show that 


ap =[la «+ ap]" 


2 A AP T 
=(1+5—5 J le-a 5[10--- 0] 
o2, + A2P o2,+(P+1)A 


(c) Show that the MEM estimate based on the above coefficients is given by 


AZ 
oe 1 
o2,+(P4+1)A2 
A2 
o2,+(P+4+1)A?2 


Ry (el) = : 


Wr (el (@—0)) 


where Wr (e/ ®) is the DTFT of the (P + 1) length rectangular window. 
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9.7 


9.8 


9.9 


9.10 


9.11 


An AR(2) process y(n) is observed in noise v() to obtain x(n), that is, 
x(n) = y(n)+0(n) —-v(n) ~ WGN(0, 0?) 
where v(7) is uncorrelated with y(n) and 
y(n) = 1.27y(n — 1) — 0.81 y(n — 2) + w(n) w(n) ~ WGN(O, 1) 

(a) Determine and plot the true power spectrum Rx (e/ @), 

(b) Generate 10 realizations of x(n), each with N = 256 samples. Using the LS approach with 
forward-backward linear predictor, estimate the power spectrum for P = 2 and o2 =1, 
Obtain an overlay plot of this estimate, and compare it with the true spectrum. 

(c) Repeat part (b), using os = 10. Comment on the effect of increasing noise variance on 
spectrum estimates. 


(d) Since the noise variance o affects only rx (0), investigate the effect of subtracting a small 
amount from 7; (0) on the spectrum estimates in part (c). 


Let x(n) be a random process whose correlation is estimated. The values for the first five lags 
are ry (0) = 1, ry (1) = 0.7, ry (2) = 0.5, ry (3) = 0.3, and r;, (4) = 0. 


(a) Determine and plot the Blackman-Tukey power spectrum estimate. 

(b) Assume that x (7) is modeled by an AP(2) model. Determine and plot its spectrum estimate. 

(c) Now repeat (b) assuming that AP(4) is an appropriate model for x (1). Determine and plot 
the spectrum estimate. 


The narrowband process x(n) is generated using the AP(4) model 
1 
1 + 0.98z—1 + 1.92z-? + 0.942—3 + 0.922-4 
driven by WGN (0, 0.001). Generate 10 realizations, each with N = 256 samples, of this process. 


A(z) = 


(a) Determine and plot the true power spectrum Rx (e/ @), 

(b) Using the LS approach with forward linear predictor, estimate the power spectrum for 
P = 4. Obtain an overlay plot of this estimate, and compare it with the true spectrum. 

(c) Repeat part (b) with P = 8 and 12. Provide a qualitative description of your results with 
respect to model order size. 

(d) Using the LS approach with forward-backward linear predictor, estimate the power spectrum 
for P = 4. Obtain an overlay plot of this estimate. Compare it with the plot in part (b). 


Consider the following PZ(4, 2) model 


fz + 


~ 1404124 
driven by WGN(0, 1) to obtain a broadband ARMA process x(n). Generate 10 realizations, 
each with N = 256 samples, of this process. 


H(z) 


(a) Determine and plot the true power spectrum Rx (e/ @), 

(b) Using the LS approach with forward-backward linear predictor, estimate the power spectrum 
for P = 12. Obtain an overlay plot of this estimate, and compare it with the true spectrum. 

(c) Using the nonlinear LS pole-zero modeling algorithm of Section 9.3.3, estimate the power 
spectrum for P = 4 and Q = 2. Obtain an overlay plot of this estimate, and compare it 
with the plot in part (D). 


A random process x (1) is given by 
mn 2mn 
x(n) = cos (= + 61) + w(n) — w(n — 2) + cos (> fs 62) 


3 


where w(n) ~ WGN(O, 1) and 6; and 62 are IID random variables uniformly distributed 
between 0 and 277. Generate a sample sequence with N = 256 samples. 


(a) Determine and plot the true spectrum Ry (el Ore 

(b) Using the LS approach with forward-backward linear predictor, estimate the power spectrum 
for P = 10, 20, and 40 from the generated sample sequence. Compare it with the true 
spectrum. 


9.12 


9.13 


9.14 


9.15 


9.16 


9.17 


(c) Using the nonlinear LS pole-zero modeling algorithm of Section 9.3.3, estimate the power 
spectrum for P = 4 and Q = 2. Compare it with the true spectrum and with the plot in 
part (db). 


Show that, for large values of N, the modeling error variance estimate given by Equation (9.2.38) 
can be approximated by the estimate given by Equation (9.2.39). 


This problem investigates the effect of correlation aliasing observed in LS estimation of model 
parameters when the AP model is excited by discrete spectra. Consider an AP(1) model with 
pole at z = a excited by a periodic sequence of period N. Let x(n) be the output sequence. 


(a) Show that the correlation at lag 1 satisfies 


aN-l4q 
rx) = “eg (P.1) 

(b) Using the LS approach, determine the estimate & as a function of w and N. Compute @ for 
a =0.9 and N = 10. 

(c) Generate x(n), using a = 0.95 and the periodic impulse train with N = 10. Compute and 
plot the correlation sequence rx (/),0 <1 < N — 1, of x(n). Compare your plot with the 
AP(1) model correlation for a = 0.95. Comment on your observations and discuss why 
they explain the discrepancy between a and @. 

(d) Repeat part (c) for N = 100 and 1000. Show analytically and numerically that @ — o as 
N > oo. 


In this problem, we investigate the equation error method of Section 9.3.1. Consider the PZ(2, 
2) model 


x(n) = 0.3x(n — 1) + 0.4x(n — 2) + w(n) + 0.25w(n — 2) 


Generate N = 200 samples of x(n), using w(n) ~ WGN(0, V 10). Record values of both x(n) 
and w(n). 


(a) Using the residual windowing method, that is, Nj = max(P, Q) and Ng = N —1, compute 
the estimates of the above model parameters. 

(b) Compute the input variance estimate 62 from your estimated values in part (a). Compare 
it with the actual value on and with (9.3.12). 


Consider the following PZ(4, 2) model 


x(n) = 1.8766x(n — 1) — 2.6192x(n — 2) + 1.6936x(n — 3) — 0.8145x(n — 4) 
+ w(n) + 0.05w(n — 1) — 0.855w(n — 2) 


excited by w(n) ~ WGN(0, V10). Generate 300 samples of x(n). 


(a) Using the nonlinear LS pole-zero modeling algorithm of Section 9.3.3, estimate the param- 
eters of the above model from the x(n) data segment. 

(b) Assuming the AP(10) model for the data segment, estimate its parameters by using the LS 
approach described in Section 9.2. 

(c) Generate a plot similar to Figure 9.13 by computing spectra corresponding to the true PZ(4, 
2), estimated PZ(4, 2), and estimated AP(10) models. Compare and comment on your results. 


Using matrix notation, show that AZ power spectrum estimation is equivalent to the Blackman- 
Tukey method discussed in Chapter 5. 


Consider the PZ(4, 2) model given in Problem 9.15. Generate 300 samples of x(n). 


(a) Fit an AP(5) model to the data and plot the resulting spectrum. 

(b) Fit an AP(10) model to the data and plot the resulting spectrum. 

(c) Fit an AP(50) model to the data and plot the resulting spectrum. 

(d) Compare your plots with the true spectrum, and discuss the effect of model mismatch on 
the quality of the spectrum. 
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9.18 


9.19 


9.20 


9.21 


9.22 


9.23 


9.24 


9.25 


9.26 


9.27 


9.28 


9.29 


Use the supplied (about 50-ms) segment of a speech signal sampled at 8192 samples per second. 


(a) Compute a periodogram of the speech signal (see Chapter 5). 

(b) Using data windowing, fit an AP(16) model to the speech data and compute the spectrum. 

(c) Using the residual windowing, fit a PZ(12, 6) model to the speech data and compute the 
spectrum. 

(d) Plot the above three spectra on one graph, and comment on the performance of each method. 


One practical approach to spectrum estimation discussed in Section 9.4 is the prewhitening and 
postcoloring method. 


(a) Develop a MATLAB function to implement this method. Use the forward/backward LS 
method to determine AP(P) parameters and the Welch method for nonparametric spectrum 
estimation. 

(b) Verify your function on the short segment of the speech segment from Problem 9.18. 

(c) Compare your results with those obtained in Problem 9.18. 


Consider a white noise process with variance oe, Find its minimum-variance power spectral 
estimate. 


Find the minimum-variance spectrum of a first-order all pole model, that is, 


x(n) = —a,x(n— 1) + w(n) 


The filter coefficient vector for the minimum-variance spectrum estimator is given in (9.5.10). 
Using Lagrange multipliers, discussed in Appendix B, solve this constrained optimization to 
find this weight vector. 


Using the relationship between the minimum-variance and the all-pole model spectrum estima- 
tors in (9.5.22), generate a recursive relationship for the minimum-variance spectrum estimators 


of increasing window length. In other words, write Rie (e/27F) in terms of Re (ef27f) 


and the all-pole model spectrum estimator foe ) (e/ on f ) in (9.5.20). 


In Pisarenko harmonic decomposition, discussed in Section 9.6.2, we determine the frequencies 
of the complex exponentials in white noise through the use of the pseudospectrum. The word 
pseudospectrum was used because its value does not correspond to an estimated power. Find a 
set of linear equations that can be solved to find the powers of the complex exponentials. Hint: 
Use the relationship of eigenvalues and eigenvectors Ry qm = AmQm form = 1,2,...,M. 


For the MUSIC algorithm, we showed a means of using the MUSIC pseudospectrum to derive a 
polynomial that could be rooted to obtain frequency estimates, which is known as root-MUSIC. 
Find a similar rooting method for the minimum-norm frequency estimation procedure. 


The Pisarenko harmonic decomposition, MUSIC, and minimum-norm algorithms yield fre- 
quency estimates by computing a pseudospectrum using the Fourier transforms of the eigen- 
vectors. However, these pseudospectra do not actually estimate a power. Derive the minimum- 
variance spectral estimator in terms of the Fourier transforms of the eigenvectors and the asso- 
ciated eigenvalues. Relate this result to the MUSIC and eigenvector method pseudospectra. 


Show that the pseudospectrum for the MUSIC algorithm is equivalent to the minimum-variance 
spectrum in the case of an infinite signal-to-noise ratio. 


Find a relationship between the minimum-norm pseudospectrum and the all-pole model spec- 
trum in the case of an infinite signal-to-noise ratio. 


In (9.5.22), we derived a relationship between the minimum-variance spectral estimator and 
spectrum estimators derived from all-pole models of orders 1 to M. Find a similar relationship 
between the pseudospectra of the MUSIC and minimum-norm algorithms that shows that the 
MUSIC pseudospectrum is a weighted average of minimum-norm pseudospectra. 


CHAPTER 10 


Adaptive Filters 


In Chapter 1, we discussed different practical applications that demonstrated the need for 
adaptive filters, pointed out the key aspects of the underlying signal operating environment 
(SOE), and illustrated the key features and types of adaptive filters. The defining charac- 
teristic of an adaptive filter is its ability to operate satisfactorily, according to a criterion of 
performance acceptable to the user, in an unknown and possibly time-varying environment 
without the intervention of the designer. In Chapter 6, we developed the theory of optimum 
filters under the assumption that the filter designer has complete knowledge of the statistical 
properties (usually second-order moments) of the SOE. However, in real-world applications 
such information is seldom available, and the most practical solution is to use an adaptive 
filter. Adaptive filters can improve their performance, during normal operation, by learning 
the statistical characteristics through processing current signal observations. 

In this chapter, we develop a mathematical framework for the design and performance 
evaluation of adaptive filters, both theoretically and by simulation. The goal of an adaptive 
filter is to “find and track” the optimum filter corresponding to the same signal operating 
environment with complete knowledge of the required statistics. In this context, optimum 
filters provide both guidance for the development of adaptive algorithms and a yardstick 
for evaluating the theoretical performance of adaptive filters. We start in Section 10.1 with 
discussion of a few typical application problems that can be effectively solved by using 
an adaptive filter. The performance of adaptive filters is evaluated using the concepts of 
stability, speed of adaptation, quality of adaptation, and tracking capabilities. These issues 
and the key features of an adaptive filter are discussed in Section 10.2. Since most adaptive 
algorithms originate from deterministic optimization methods, in Section 10.3 we introduce 
the family of steepest-descent algorithms and study their properties. Sections 10.4 and 10.5 
provide a detailed discussion of the derivation, properties, and applications of the two most 
important adaptive filtering algorithms: the least mean square (LMS) and the recursive 
least-squares (RLS) algorithms. The conventional RLS algorithm, introduced in Section 
10.5, can be used for either array processing (multiple-sensor or general input data vector) 
applications or FIR filtering (single-sensor or shift-invariant input data vector) applications. 
Section 10.6 deals with different implementations of the RLS algorithm for array processing 
applications, whereas Section 10.7 provides fast implementations of the RLS algorithm for 
the FIR filtering case. The development of the later algorithms is a result of the shift 
invariance of the data stored in the memory of the FIR filter. Finally, in Section 10.8 we 
provide a concise introduction to the tracking properties of the LMS and the RLS algorithms. 
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10.1 TYPICAL APPLICATIONS OF ADAPTIVE FILTERS 


As we have already seen in Chapter 1, many practical applications cannot be successfully 
solved by using fixed digital filters because either we do not have sufficient information to 
design a digital filter with fixed coefficients or the design criteria change during the normal 
operation of the filter. Most of these applications can be successfully solved by using a 
special type of “smart” filters known collectively as adaptive filters. The distinguishing 
feature of adaptive filters is that they can modify their response to improve their performance 
during operation without any intervention from the user. 

The best way to introduce adaptive filters is with some applications for which they are 
well suited. These and other applications are discussed in greater detail in the sequel as we 
develop the necessary background and tools. 


10.1.1 Echo Cancelation in Communications 


An echo is the delayed and distorted version of an original signal that returns to its source. 
In some applications (radar, sonar, or ultrasound), the echo is the wanted signal; however, in 
communication applications, the echo is an unwanted signal that must be eliminated. There 
are two types of echoes in communication systems: (1) electrical or line echoes, which 
are generated electrically due to impedance mismatches at points along the transmission 
medium, and (2) acoustic echoes, which result from the reflection of sound waves and 
acoustic coupling between a microphone and a loudspeaker. 

Here we focus on electrical echoes in voice communications; electrical echoes in data 
communications are discussed in Section 10.4.4, and acoustic echoes in teleconferencing 
and hands-free telephony were discussed in Section 1.4.1. 

Electrical echoes are observed on long-distance telephone circuits. A simplified form 
of such a circuit, which is sufficient for the present discussion, is shown in Figure 10.1. 
The local links from the customer to the telephone office consist of bidirectional two-wire 
connections, whereas the connection between the telephone offices is a four-wire carrier 
facility that may include a satellite link. The conversion between two-wire and four-wire 
links is done by special devices known as hybrids. An ideal hybrid should pass (1) the 
incoming signal to the two-wire output without any leakage into its output port and (2) the 
signal from the two-wire circuit to its output port without reflecting any energy back to the 
two-wire line (Sondhi and Berkley 1980). In practice, due to impedance mismatches, the 
hybrids do not operate perfectly. As a result, some energy on the incoming branch of the 
four-wire circuit leaks into the outgoing branch and returns to the source as an echo (see 
Figure 10.1). This echo, which is usually 11 dB down from the original signal, makes it 
difficult to carry on a conversation if the round-trip delay is larger than 40 ms. Satellite 
links, as a consequence of high altitude, involve round-trip delays of 500 to 600 ms. 


Four-wire connection 


Two-wire 
connection 


Talker Echo of A's 
A speech B 
FIGURE 10.1 


Echo generation in a long-distance telephone network. 


The first devices used by telephone companies to control voice echoes were echo 
suppressors. Basically, an echo suppressor is a voice-activated switch that attempts to 
impose an open circuit on the return path from listener to talker when the listener is silent 
(see Figure 10.2). The main problems with these devices are speech clipping during double- 
talking and the inability to effectively deal with round-trip delays longer than 100 ms 
(Weinstein 1977). 
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FIGURE 10.2 
Principle of echo suppression. 


The problems associated with echo suppressors could be largely avoided if we could 
estimate the transmission path from point C to point D (see Figure 10.3), which is known 
as the echo path. If we knew the echo path, we could design a filter that produced a copy 
or replica of the echo signal when driven by the signal at point C. Subtraction of the echo 
replica from the signal at point D will eliminate the echo without distorting the speech of 
the second talker that may be present at point D. The resulting device, shown in Figure 
10.3, is known as an echo canceler. 
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FIGURE 10.3 
Principle of echo cancelation. 


In practice, the channel characteristics are generally not known. For dial-up telephone 
lines, the channel differs from call to call, and the characteristics of radio and microwave 
channels (phase perturbations, fading, etc.) change significantly with time. Therefore, we 
cannot design and use a fixed echo canceler with satisfactory performance for all possible 
connections. There are two possible ways around this problem: 
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1. Design acompromise fixed echo canceler based on some “average” echo path, assuming 
that we have sufficient information about the connections to be seen by the canceler. 

2. Design an adaptive echo canceler that can “learn” the echo path when it is first turned on 
and afterward “tracks” its variations without any intervention from the designer. Since 
an adaptive canceler matches the echo path for any given connection, it performs better 
than a compromise fixed canceler. 


We stress that the main task of the canceler is to estimate the echo signal with sufficient 
accuracy; the estimation of the echo path is simply the means of achieving this goal. The 
performance of the canceler is measured by the attenuation, in decibels, of the echo, which 
is known as echo return loss enhancement. The adaptive echo canceler achieves this goal 
by modifying its response, using the residual echo signal in an as yet unspecified way. 

Adaptive echo cancelers are widely used in voice telecommunications, and the inter- 
national standards organization CCITT has issued a set of recommendations (CCITT G 
165) that outlines the basic requirements for echo cancelers. More details can be found in 
Weinstein (1977) and Murano et al. (1990). 


10.1.2 Equalization of Data Communication Channels 


Channel equalization, which is probably the most widely employed technique in practical 
data transmission systems, was first introduced in Section 1.4.1. In Section 6.8 we discussed 
the design of symbol rate zero-forcing and optimum MSE equalizers. As we recall, every 
pulse propagating through the channel suffers a certain amount of time dispersion because 
the frequency response of the channel deviates from the ideal one of constant magnitude 
and linear phase. Some typical sources of dispersion for practical communication channels 
are summarized in Table 10.1. As a result, the tails of adjacent pulses interfere with the 
measurement of the current pulse (intersymbol interference) and can lead to an incorrect 
decision. 


TABLE 10.1 
Summary of causes of dispersion in various communications systems. 


Transmission system Causes of dispersion 


Cable TV Transmitter filtering; coaxial-cable dispersion; cable amplifiers; reflections from 
impedance mismatches; bandpass filters 


Microwave radio Transmitter filtering; reflections from impedance mismatches; multipath propaga- 
tion; scattering; input bandpass filtering 


Voiceband modems Digital-to-analog image suppression; channel filtering; twisted-pair transmission 
line; multiplexing and demultiplexing filters; hybrids; antialias lowpass filters 


Troposcatter radio Transmitter filtering; atmospheric dispersion; scattering at interface between tropo- 
sphere and stratosphere; receiver bandpass filtering; input amplifiers 


Source: From Treichler et al. 1996. 


Since the channel can be modeled as a linear system, assuming that the receiver and the 
transmitter do not include any nonlinear operations, we can compensate for its distortion by 
using a linear equalizer. The goal of the equalizer is to restore the received pulse, as closely 
as possible, to its original shape. The equalizer transforms the channel to a near-ideal one if 
its response resembles the inverse of the channel. Since the channel is unknown and possibly 
time-varying, there are two ways to approach the problem: (1) Design a compromise fixed 
equalizer to obtain satisfactory performance over a broad range of channels, or (2) design 
an equalizer that can learn the inverse of the particular channel and then track its variation 
in real time. 


The characteristics of the equalizer are adjusted by some algorithm that attempts to 
attain the best possible performance. The most appropriate criterion of performance for 
data transmission systems is the probability of error. However, it cannot be used for two 
reasons: (1) the “correct” symbol is unknown to the receiver (otherwise there would be 
no reason to communicate), and (2) the number of decisions needed to estimate the low 
probabilities of error is extremely large. Thus, practical equalizers assess their performance 
by using some function of the difference between the correct symbol and their output. The 
operation of practical equalizers involves three modes of operation, dependent on how we 
substitute for the unavailable correct symbol sequence. 


Training mode: A known training sequence is transmitted, and the equalizer attempts 
to improve its performance by comparing its output to a synchronized replica of the 
training sequence stored at the receiver. Usually this mode is used when the equalizer 
starts a transmission session. 

Decision-directed mode: At the end of the training session, when the equalizer starts 
making reliable decisions, we can replace the training sequence with the equalizer’s 
own decisions. 

“Blind” or self-recovering mode: There are several applications in which the use of a 
training sequence is not desired or feasible. This may occur in multipoint networks 
for computer communications or in wideband digital systems over coaxial facilities 
during rerouting (Godard 1980; Sato 1975). Also when the decision-directed mode 
of a microwave channel equalizer fails, after deep fades, we do not have a reverse 
channel to call for retraining (Foschini 1985). In such cases, where the equalizer 
should be able to learn or recover the characteristics of the channel without the 
benefit of a training sequence, we say that the equalizer operates in blind or self- 
recovering mode. 


Adaptive equalization is a mature technology that has had the greatest impact on digital 
communications systems, including voiceband, microwave and troposcatter radio, and cable 
TV modems (Qureshi 1985; Lee and Messerschmitt 1994; Gitlin et al. 1992; Bingham 1988; 
Treichler et al. 1996, 1998). 


10.1.3 Linear Predictive Coding 


The efficient storage and transmission of analog signals using digital systems requires the 
minimization of the number of bits necessary to represent the signal while maintaining the 
quality to an acceptable level according to a certain criterion of performance. The conversion 
of an analog (continuous-time, continuous-amplitude) signal to a digital (discrete-time, 
discrete-amplitude) signal involves two processes: sampling and quantization. Sampling 
converts a continuous-time signal to a discrete-time signal by measuring its amplitude 
at equidistant intervals of time. Quantization involves the representation of the measured 
continuous amplitude by using a finite number of symbols. Therefore, a small range of 
amplitudes will use the same symbol (see Figure 10.4). A code word is assigned to each 
symbol by the coder. When the digital representation is used for digital signal processing, the 
quantization levels and the corresponding code words are uniformly distributed. However, 
for coding applications, levels may be nonuniformly distributed to match the distribution 
of the signal amplitudes. 

For all practical purposes, the range of a quantizer is equal to Rg = A- 28 where 
A is the quantization step size and B is the number of bits, and should cover the dynamic 
range of the signal. The difference between the unquantized sample x(n) and the quantized 
sample x(n), that is, 


e(n) = X(n) — x(n) (10.1.1) 
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FIGURE 10.4 


Partitioning of the range of a 3-bit (eight-level) uniform quantizer. 


is known as the quantization error and is always in the range —A/2 < e(n) < A/2. If we 
define the signal-to-noise ratio by 


E 2 
SNR & Eb} (10.1.2) 
E{e?(n)} 
it can be shown (Rabiner and Schafer 1978; Jayant and Noll 1984) that 
SNR(B) ~ 6B (10.1.3) 


which states that each added binary digit increases the SNR by 6 dB. 

Fora fixed number of bits, decreasing the dynamic range of the signal (and therefore the 
range of the quantizer) decreases the required quantization step and therefore the average 
quantization error power. Therefore, we can increase the SNR by reducing the dynamic 
range, or equivalently the variance of the signal. If the signal samples are significantly 
correlated, the variance of the difference between adjacent samples is smaller than the 
variance of the original signal. Thus, we can improve the SNR by quantizing this difference 
instead of the original signal. 

The differential quantization concept is exploited by the linear predictive coding (LPC) 
system illustrated in Figure 10.5. The quantized signal is the difference 


d(n) = x(n) — x(n) (10.1.4) 


where x(n) is an estimate or prediction of the signal x(n) obtained by the predictor using 
a quantized version 


R(n) = X(n) + d(n) (10.1.5) 
of the original signal (see Figure 10.5). If the quantization error of the difference signal is 
ea(n) = d(n) — d(n) (10.1.6) 
we obtain X(n) = x(n) + eg(n) (10.1.7) 


using (10.1.4) and (10.1.5). The significance of (10.1.7) is that the quantization error of the 
original signal is equal to the quantization error of the difference signal, independently of 
the properties of the predictor. Note that if c’(n) = c(n), that is, there are no transmission 
or storage errors, then the signal reconstructed by the decoder is x'(n) = x(n). If the 
prediction is good, the dynamic range of d(n) should be smaller than the dynamic range 
of x(n), resulting in a smaller quantization noise for the same number of bits or the same 
quantization noise with a smaller number of bits. The performance of the LPC system 
depends on the accuracy of the predictor. In most practical applications, we use a linear 
predictor that forms an estimate (prediction) x(n) of the present sample x(n) as a linear 
combination of the M past samples, that is, 


M 
&(n) = D> ak(n — k) (10.1.8) 
k=1 
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FIGURE 10.5 
Block diagram of a linear predictive coding system: (a) coder 
and (b) decoder. 


The coefficients { ay} of the linear predictor are determined by exploiting the correlation 
between adjacent samples of the input signal with the objective to make the prediction error 
as small as possible. Since the statistical properties of the signal x(n) are unknown and 
change with time, we cannot design an optimum fixed predictor. The established practi- 
cal solution uses an adaptive linear predictor that automatically adjusts its coefficients to 
compute a “good” prediction at each time instant. A detailed discussion of adaptive linear 
prediction and its application to audio, speech, and video signal coding is provided in Jayant 
and Noll (1984). 


10.1.4 Noise Cancelation 


In Section 1.4.1 we discussed the concept of active noise control using adaptive filters. 
We now provide a theoretical explanation for the general problem of noise canceling using 
multiple sensors. The principle of general noise cancelation is illustrated in Figure 10.6. The 
signal of interest s(”) is corrupted by uncorrelated additive noise v; (7), and the combined 
signal s(n) + v(m) provides what is known as primary input. A second sensor, located 
at a different point, acquires a noise v2(n) (reference input) that is uncorrelated with the 
signal s(n) but correlated with the noise vj (1). If we can design a filter that provides a good 
estimate }(n) of the noise v,(7), by exploiting the correlation between v; (1) and v2(n), 
then we could recover the desired signal by subtracting }(n) © v1 (n) from the primary 
input. 

Let us assume that the signals s(7), v; (7), and v2(n) are jointly wide-sense stationary 
with zero mean values. The “clean” signal is given by the error 


e(n) = s(n) + [1 (2) — Y(n)] 
where (1) depends on the filter structure and parameters. The MSE is given by 
Elle(n)["} = Ells@)?} + Ellui(a) — $@)/7} 
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FIGURE 10.6 
Principle of adaptive noise cancelation using a reference input. 
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because the signals s(n) and vj (n) — $(n) are uncorrelated. Since the signal power is not 
influenced by the filter, if we design a filter that minimizes the total output power E{|e(n)|7}, 
then that filter will minimize the output noise power F {|v;(n) — $(n)|7}. Therefore, j(n) 
will be the MMSE estimate of the noise v;(m), and the canceler maximizes the output 
signal-to-noise ratio. If we know the second-order moments of the primary and reference 
inputs, we can design an optimum linear canceler using the techniques discussed in Chapter 
6. However, in practice, the design of an optimum canceler is not feasible because the 
required statistical moments are either unknown or time-varying. Once again, a successful 
solution can be obtained by using an adaptive filter that automatically adjusts its parameters 
to obtain the best possible estimate of the interfering noise (Widrow et al. 1975). 


10.2 PRINCIPLES OF ADAPTIVE FILTERS 


In this section, we discuss a mathematical framework for the analysis and performance eval- 
uation of adaptive algorithms. The goal is to develop design guidelines for the application of 
adaptive algorithms to practical problems. The need for adaptive filters and representative 
applications that can benefit from their use have been discussed in Sections 1.4.1 and 10.1. 


10.2.1 Features of Adaptive Filters 


The applications we have discussed are only a sample from a multitude of practical problems 
that can be successfully solved by using adaptive filters, that is, filters that automatically 
change their characteristics to attain the right response at the right time. Every adaptive 
filtering application involves one or more input signals and a desired response signal that 
may or may not be accessible to the adaptive filter. We collectively refer to these signals as 
the signal operating environment (SOE) of the adaptive filter. Every adaptive filter consists 
of three modules (see Figure 10.7): 


Filtering structure. This module forms the output of the filter using measurements of 
the input signal or signals. The filtering structure is /inear if the output is obtained 
as a linear combination of the input measurements; otherwise it is said to be nonlin- 
ear. For example, the filtering module can be an adjustable finite impulse response 
(FIR) digital filter implemented with a direct or lattice structure or a recursive filter 
implemented using a cascade structure. The structure is fixed by the designer, and 
its parameters are adjusted by the adaptive algorithm. 

Criterion of performance (COP). The output of the adaptive filter and the desired 
response (when available) are processed by the COP module to assess its quality 
with respect to the requirements of the particular application. The choice of the 
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FIGURE 10.7 
Basic elements of a general adaptive filter. 


criterion is a balanced compromise between what is acceptable to the user of the 
application and what is mathematically tractable; that is, it can be manipulated to 
derive an adaptive algorithm. Most adaptive filters use some average form of the 
square error because it is mathematically tractable and leads to the design of useful 
practical systems. 

Adaptation algorithm. The adaptive algorithm uses the value of the criterion of per- 
formance, or some function of it, and the measurements of the input and desired 
response (when available) to decide how to modify the parameters of the filter to 
improve its performance. The complexity and the characteristics of the adaptive 
algorithm are functions of the filtering structure and the criterion of performance. 


The design of any adaptive filter requires some generic a priori information about the 
SOE and a deep understanding of the particular application. This information is needed 
by the designer to choose the criterion of performance and the filtering structure. Clearly, 
unreliable a priori information and/or incorrect assumptions about the SOE can lead to 
serious performance degradations or even unsuccessful adaptive filter applications. The 
conversion of the performance assessment to a successful parameter adjustment strategy, 
that is, the design of an adaptive algorithm, is the most difficult step in the design and 
application of adaptive filters. 

If the characteristics of the SOE are constant, the goal of the adaptive filter is to find 
the parameters that give the best performance and then stop the adjustment. The initial 
period, from the time the filter starts its operation until the time it gets reasonably close to 
its best performance, is known as the acquisition or convergence mode. However, when the 
characteristics of the SOE change with time, the adaptive filter should first find and then 
continuously readjust its parameters to track these changes. In this case, the filter starts with 
an acquisition phase that is followed by a tracking mode. 

A very influential factor in the design of adaptive algorithms is the availability of a 
desired response signal. We have seen that for certain applications, the desired response may 
not be available for use by the adaptive filter. Therefore, the adaptation must be performed 
in one of two ways: 


Supervised adaptation. At each time instant, the adaptive filter knows in advance the 
desired response, computes the error (i.e., the difference between the desired and 
actual response), evaluates the criterion of performance, and uses it to adjust its co- 
efficients. In this case, the structure in Figure 10.7 is simplified to that of Figure 10.8. 

Unsupervised adaptation. When the desired response is unavailable, the adaptive filter 
cannot explicitly form and use the error to improve its behavior. In some applications, 
the input signal has some measurable property (i.e., constant envelope) that is lost 
by the time it reaches the adaptive filter. The adaptive filter adjusts its parameters in 
such a way as to restore the lost property of the input signal. The property restoral 
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FIGURE 10.8 
Basic elements of a supervised adaptive filter. 


approach to adaptive filtering was introduced in Treichler et al. (1987). In some 
other applications (e.g., digital communications) the basic task of the adaptive filter 
is to classify each received pulse to one of a finite set of symbols. In this case we 
basically have a problem of unsupervised classification (Fukunaga 1990). 


In this chapter we focus our discussion on supervised adaptive filters, that is, filters 
that have access to a desired response signal; unsupervised adaptive filters, which operate 
without the benefit of a desired response, are discussed in Section 12.3, in the context of 
blind equalization. 


10.2.2 Optimum versus Adaptive Filters 


We have mentioned several times that the theory of stochastic processes provides the math- 
ematical framework for the design and analysis of optimum filters. In Chapter 6, we in- 
troduced filters that are optimum according to the MSE criterion of performance; and in 
Chapter 7, we developed algorithms and structures for their efficient design and imple- 
mentation. However, optimum filters are a theoretical tool and cannot be used in practical 
applications because we do not know the statistical quantities (e.g., second-order moments) 
that are required for their design. Adaptive filters can be thought as the practical counterpart 
of optimum filters: They try to reach the performance of optimum filters by processing 
measurements of the SOE in real time, which makes up for the lack of a priori statistics. 

For this analysis, we consider the general case of a linear combiner that includes filtering 
and prediction as special cases. However, for convenience we use the terms filters and 
filtering. We remind the reader that, from a mathematical point of view, the key difference 
between a linear combiner and an FIR filter or predictor is the shift invariance (temporal 
ordering) of the input data vector. This difference, which is illustrated in Figure 10.9, also 
has important implications in the implementation of adaptive filters. To this end, suppose 
that the SOE is comprised of M input signals x, (n, ¢) and a desired response signal y(n, ¢), 
which are sample realizations of random sequences.’ 

Then the estimate of y(”, ¢) is computed by using the linear combiner 


M 
Hin, 6) = Def (n)xe(n, 6) Se! (nyxta, 0) (10.2.1) 
k=1 
where e(n) = [c1(n) c2(n) «++ ey(n)|" (10.2.2) 


"For clarity, in this section only, we include the dependence on ¢ to denote random variables. 
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FIGURE 10.9 
Illustration of the difference of the input signal between (a) a multiple-input linear 
combiner and () a single-input FIR filter. 


is the coefficient vector and 


x(n, ¢) = [x1(n, f) x2(n, 6) +++ xm (n, OI" (10.2.3) 
is the input data vector. For single-sensor applications, the input data vector is shift-invariant 
x(n) = [x(n, 0) x(2— 1,6) --- xm —M 41,0)" (10.2.4) 
and the linear combiner takes the form of the FIR filter 
M-1 
In, f) = bz h(n, k)x(n — k, ¢) © e" (n)x(n, £) (10.2.5) 
k=0 


where cx(n) = h*(n, k) are the samples of the impulse response at time n. 


Optimum filters. If we know the second-order moments of the SOE, we can design an 
optimum filter c,(n) by solving the normal equations 


R(n)eo(n) = d(n) (10.2.6) 
where R(n) = E{x(n, £)x"(n, £)} (10.2.7) 
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and d(n) = E{x(n, £)y*(n, £)} (10.2.8) 


are the correlation matrix of the input data vector and the cross-correlation between the 
input data vector and the desired response, respectively. During its normal operation, the 
optimum filter works with specific realizations of the SOE, that is, 


$o(n, £) = cf (n)x(n, £) (10.2.9) 
Eo(n, F) = y(n, 6) — Jo(n, 6) (10.2.10) 


where },(n, €) is the optimum estimate and ¢,(n, ¢) is the optimum instantaneous error [see 
Figure 10.10(a)]. However, the filter is optimized with respect to its average performance 
across all possible realizations of the SOE, and the MMSE 


P,(n) = Efleo(n, €)|7} = Py(n) — a” (n)e,(n) (10.2.11) 


shows how well the filter performs on average. Also, we emphasize that the optimum 
coefficient vector is a nonrandom quantity and that the desired response is not essential for 
the operation of the optimum filter [see Equation (10.2.9)]. 


Desired 
response 


Hn, £) x(n, £) 


Input 
signal 
Error 
Solve 
R(@)e,(n) = d(n) Adaptive 
algorithm 
(a) (b) 
FIGURE 10.10 


Illustration of the difference in operation between (a) optimum filters and 
(b) adaptive filters. 


If the SOE is stationary, the optimum filter is computed once and is used with all 
realizations {x(n, ¢), y(n, ¢)}. For nonstationary environments, the optimum filter design 
is repeated at every time instant n because the optimum filter is time-varying. 


Adaptive filters. In most practical applications, where the second-order moments R (1) 
and d(m) are unknown, the use of an adaptive filter is the best solution. If the SOE is ergodic, 
we have 


N 
_ ee H 
R= lim aN De ¥O bx (n, ¢) (10.2.12) 
N 
d > x(n, C)y*(n, C) (10.2.13) 


= lim —— 
N>co 2N+ 1 ba 


because ensemble averages are equal to time averages (see Section 3.3). If we collect a 
sufficient amount of data {x(n, ¢), y(n, oF ~! we can obtain an acceptable estimate of 
the optimum filter by computing the estimates 


N-1 


, 1 
Rw(o) = 7 Y- xn, ox" (n, 6) (10.2.14) 


n=0 


N-1 
A 1 
dv(Q)= > Yo x(n, Oy*(n, 6) (10.2.15) 
n=0 
by time-averaging and then solving the linear system 


Ry (o)en(f) = d(C) (10.2.16) 


The obtained coefficients can be used to filter the data in the intervalO < n < N-1 
or to start filtering the data for n > N, on a sample-by-sample basis, in real time. This 
procedure, which we called block adaptive filtering in Chapter 8, should be repeated each 
time the properties of the SOE change significantly. Clearly, block adaptive filters cannot 
track statistical variations within the operating block and cannot be used in all applications. 

Indeed, there are applications, for example, adaptive equalization, in which each input 
sample should be processed immediately after its observation and before the arrival of the 
next sample. In such cases, we should use a sample-by-sample adaptive filter that starts 
filtering immediately after the observation of the pair {x(0), y(O)} using a “guess” e(—1) 
for the adaptive filter coefficients. Usually, the initial guess c(—1) is a very poor estimate of 
the optimum filter cg. However, this estimate is improved with time as the filter processes 
additional pairs of observations. 

As we discussed in Section 10.2.1, an adaptive filter consists of three key modules: an 
adjustable filtering structure that uses input samples to compute the output, the criterion 
of performance that monitors the performance of the filter, and the adaptive algorithm that 
updates the filter coefficients. The key component of any adaptive filter is the adaptive 
algorithm, which is a rule to determine the filter coefficients from the available data x(n, ¢) 
and y(n, ¢) [see Figure 10.10(b)]. The dependence of e(n, ¢) on the input signal makes the 
adaptive filter a nonlinear and time-varying stochastic system. 

The data available to the adaptive filter at time n are the input data vector x(n, ¢), the 
desired response y(n, €), and the most recent update c(n — 1, ¢) of the coefficient vector. 
The adaptive filter, at each time 1, performs the following computations: 


1. Filtering: 


Sin, 6) =e (n — 1, E)x(n, 6) (10.2.17) 
2. Error formation: 
e(n, 6) = y(n, f) — ¥(n, £) (10.2.18) 
3. Adaptive algorithm: 
e(n, €) =e(n — 1, €) + Ac{x(n, f), e(n, £)} (10.2.19) 


where the increment or correction term Ac(n, ¢) is chosen to bring e(7, £) close to Cp, 
with the passage of time. If we can successively determine the corrections Ac(n, ¢) so that 
c(n, €) & Co, that is, ||e(7, €) —¢o|| < 6, forsomen > Ns, we obtain a good approximation 
for ¢, by avoiding the explicit averagings (10.2.14), (10.2.15), and the solution of the normal 
equations (10.2.16). A key requirement is that Ac(n, ¢) must vanish if the error e(n, ¢) 
vanishes. Hence, e(n, ¢) plays a major role in determining the increment Ac(n, ¢). 

We notice that the estimate }(7, ¢) of the desired response y(n, ¢) is evaluated using the 
current input vector x(n, ¢) and the past filter coefficients e(n — 1, €). The estimate j(n, ¢) 
and the corresponding error e(n, ¢) can be considered as predicted estimates compared to 
the actual estimates that would be evaluated using the current coefficient vector c(n, f). 
Coefficient updating methods that use the predicted error e(n, ¢) are known as a priori type 
adaptive algorithms. 

If we use the actual estimates, obtained using the current estimate c(n, ¢) of the adaptive 
filter coefficients, we have 


1. Filtering: 
Sa(n, 6) =e" (n, O)x(n, 6) (10.2.20) 
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2. Error formation: 
e(n,f) = y(n, ) — Sa(n, f) (10.2.21) 
3. Adaptive algorithm: 

e(n, 6) =e(n — 1,6) + Ae{x(n, ¢), e(n, 6)} (10.2.22) 
which are known as a posteriori type adaptive algorithms. The terms a priori and a posteriori 
were introduced in Carayannis et al. (1983) to emphasize the use of estimates evaluated 
before or after the updating of the filter coefficients. The difference between a priori and 
a posteriori errors and their meanings will be further clarified when we discuss adaptive 


least-squares filters in Section 10.5. The timing diagram for the above two algorithms is 
shown in Figure 10.11. 


x(n) x(n + 1) 
e(n— 1) y(n) e(n) c(n) e€(n) y(n+ 1) 
nT (n+1)T Time 


FIGURE 10.11 
Timing diagrams for a priori and a posteriori adaptive 
algorithms. 


In conclusion, the objective of an adaptive filter is to use the available data at time n, 
namely, {x(”, €), y(n, ¢), e(n — 1, £)}, to update the “old” coefficient vector c(n — 1, ¢) 
to a “new” estimate c(n, ¢) so that e(n, ¢) is closer to the optimum filter vector ¢,(m) and 
the output (7) is a better estimate of the desired response y(n). Most adaptive algorithms 
have the following form: 


New old adaptation sam. 
coefficient | = | coefficient | + gain : € ) (10.2.23) 
vector vector vector & 


where the error signal is the difference between the desired response and the predicted 
or actual outputs of the adaptive filter. One of the fundamental differences among the 
various algorithms is the optimality of the used adaptation gain vector and the amount of 
computation required for its evaluation. 


10.2.3 Stability and Steady-State Performance of Adaptive Filters 


We now address the issues of stability and performance of adaptive filters. Since the goal 
of an adaptive filter c(, ¢) is first to find and then track the optimum filter cy (”) as quickly 
and accurately as possible, we can evaluate its performance by measuring some function 
of its deviation 


e(n, ) = e(n, £) — eo(n) (10.2.24) 


from the corresponding optimum filter. Clearly, an acceptable adaptive filter should be stable 
in the bounded-input bounded-output (BIBO) sense, and its performance should be close to 
that of the associated optimum filter. The analysis of BIBO stability is extremely difficult 
because adaptive filters are nonlinear, time-varying systems working in a random SOE. The 
performance of adaptive filters is primarily measured by investigating the value of the MSE 
as a function of time. To discuss these problems, first we consider an adaptive filter working 
in a stationary SOE, and then we extend our discussion to a nonstationary SOE. 


Stability 


The adaptive filter starts its operation at time, say, n = 0, and by processing the obser- 
vations {x(n, f), ym, Ye generates a sequence of vectors {c(n, ¢ Ye° using the adaptive 
algorithm. Since the FIR filtering structure is always stable, the output or the error of the 
adaptive filter will be bounded if its coefficients are always kept close to the coefficients of 
the associated optimum filter. However, the presence of the feedback loop in every adaptive 
filter (see Figure 10.10) raises the issue of stability. In a stationary SOE, where the opti- 
mum filter c, is constant, convergence of c(n, €) to ¢g as n — oo will guarantee the BIBO 
stability of the adaptive filter. For a specific realization ¢, the kth component c,(n, ¢) or 
the norm ||c(”, ¢)|| of the vector e(7, ¢) is a sequence of numbers that might or might not 
converge.’ Since the coefficients cx, (n, ¢) are random, we must use the concept of stochastic 
convergence (Papoulis 1991). 

We say that a random sequence converges everywhere if the sequence cx(n, ¢) con- 
verges for every ¢, that is, 


dim, cen, 5) = Cok) (10.2.25) 


where the limit co,x(¢) depends, in general, on ¢. Requiring the adaptive filter to converge 
to ¢, for every possible realization of the SOE is both hard to guarantee and not necessary, 
because some realizations may have very small or zero probability of occurrence. 

If we wish to ensure that the adaptive filter converges for the realizations of the SOE 
that may actually occur, we can use the concept of convergence almost everywhere. We say 
that the random sequence cx(n, ¢) converges almost everywhere or with probability 1 if 


P{ tim |ee(1, 6) — Coe (E)| = OF = 1 (10.2.26) 


which implies that there can be some sample sequences that do not converge, which must 
occur with probability zero. 
Another type of stochastic convergence that is used in adaptive filtering is defined by 


Him, Efler(n, 0) — coal} = lim, Efléx(n, 6/7} = 0 (10.2.27) 


and is known as convergence in the MS sense. The primary reason for the use of mean 
square (MS) convergence is that unlike the almost-every where convergence, it uses only one 
sequence of numbers that takes into account the averaging effect of all sample sequences. 
Furthermore, it uses second-order moments for verification and has an interpretation in 
terms of power. Convergence in MS does not imply—nor is implied by—convergence with 
probability 1. Since 


EM. OP} _ (Ee. OP, var{x, 5) 
= + 
5 ) 5? 
if we can show that E{c,(n)} + 0 asn — oo and var{cx(n, €)} is bounded for all n, we 
can ensure convergence in MS. In this case, we can say that an adaptive filter that operates 
in a stationary SOE is an asymptotically stable filter. 


(10.2.28) 


Performance measures 


In theoretical investigations, any quantity that measures the deviation of an adaptive 
filter from the corresponding optimum filter can be used to evaluate its performance. 
The mean square deviation (MSD) 


D(n) * Efle(n, £) — eo(n)|I7} = Efle(n, O17} (10.2.29) 


"We recall that a sequence of real nonrandom numbers ap, a1, a2, ... converges to a number a if and only if for 
every positive number 6 there exists a positive integer Ns such that for all > Ns, we have |a, — a| < 5. This is 
abbreviated by limy— oo dn = a. 


513 


SECTION 10.2 
Principles of Adaptive 
Filters 


514 


CHAPTER 10 
Adaptive Filters 


measures the average distance between the coefficient vectors of the adaptive and optimum 
filters. Although the MSD is not measurable in practice, it is useful in analytical studies. 
Adaptive algorithms that minimize D(n) for each value of n are known as algorithms with 
optimum learning. 

In Section 6.2.2 we showed that if the input correlation matrix is positive definite, any 
deviation, say, ¢(7), of the optimum filter coefficients from their optimum setting increases 
the mean square error (MSE) by an amount equal to ¢7 (n)RE@(n), known as excess MSE 
(EMSE). In adaptive filters, the random deviation ¢(n, ¢) from the optimum results in an 
EMSE, which is measured by the ensemble average of en, ¢)Re(n, ©). For a posteriori 
adaptive filters, the MSE can be decomposed as 


P'(n) © Efle(n, £)/?} = Po (n) + Pi (n) (10.2.30) 

where P/, (n) is the EMSE and P/(n) is the MMSE given by 
Pin) = Efleo(n, £)/7} (10.2.31) 
with €o(n, 6) = y(n, £) — ef (n)x(n, 6) (10.2.32) 


as the a posteriori optimum filtering error. Clearly, the a posteriori EMSE P!,(n) is given 
by 


P..(n) & P'(n) — P,(n) (10.2.33) 


For a priori adaptive algorithms, where we use the “old” coefficient vector e(n — 1, ¢), 
it is more appropriate to use the a priori EMSE given by 


Pex (n) = P(n) — P,(n) (10.2.34) 
where P(n) = Efle(n, 6)/7} (10.2.35) 
and P,(n) = Efleo(n, ¢)17} (10.2.36) 
with eo(n, 6) = y(n, 6) — ce (n — 1)x(n, 6) (10.2.37) 


as the a priori optimum filtering error. If the SOE is stationary, we have €,(n, €) = eo(n, £); 
that is, the optimum a priori and a posteriori errors are identical. 
The dimensionless ratio 


Pex(n) , A P,(n) 
Bas 8. Ag 


known as misadjustment, is a useful measure of the quality of adaptation. Since the EMSE 
is always positive, there is no adaptive filter that can perform (on the average) better than 
the corresponding optimum filter. In this sense, we can say that the excess MSE or the 
misadjustment measures the cost of adaptation. 


M(n) & 


(10.2.38) 


Acquisition and tracking 


Plots of the MSD, MSE, or M(n) as a function of n, which are known as learning 
curves, characterize the performance of an adaptive filter and are widely used in theoretical 
and experimental studies. When the adaptive filter starts its operation, its coefficients provide 
a poor estimate of the optimum filter and the MSD or the MSE is very large. As the number 
of observations processed by the adaptive filter increases with time, we expect the quality 
of the estimate c(n, ¢) to improve, and therefore the MSD and the MSE to decrease. The 
property of an adaptive filter to bring the coefficient vector c(n, ¢) close to the optimum 
filter cy, independently of the initial condition c(—1) and the statistical properties of the 
SOE, is called acquisition. During the acquisition phase, we say that the adaptive filter is 
in a transient mode of operation. 

A natural requirement for any adaptive algorithm is that adaptation stops after the 
algorithm has found the optimum filter c,. However, owing to the randomness of the SOE 


and the finite amount of data used by the adaptive filter, its coefficients continuously fluctuate 
about their optimum settings, that is, about the coefficients of the optimum filter, ina random 
manner. As a result, the adaptive filter reaches a steady-state mode of operation, after a 
certain time, and its performance stops improving. 

The transient and steady-state modes of operation in a stationary SOE are illustrated in 
Figure 10.12(a). The duration of the acquisition phase characterizes the speed of adaptation 
or rate of convergence of the adaptive filter, whereas the steady-state EMSE or misadjust- 
ment characterizes the quality of adaptation. These properties depend on the SOE, the 
filtering structure, and the adaptive algorithm. 


c(n) 


Tracking 


Tracking 
9 ae | 
: n : n 
Transient |}¢— Steady state —> -»| Transient l¢— Steady state—> 
(a) Stationary SOE (b) Nonstationary SOE 
FIGURE 10.12 


Modes of operation in a stationary and nonstationary SOE. 


At each time n, any adaptive filter computes an estimate of the optimum filter using 
a finite amount of data. The error resulting from the finite amount of data is known as 
estimation error. An additional error, known as the lag error, results when the adaptive 
filter attempts to track a time-varying optimum filter ¢,(m) in a nonstationary SOE. The 
modes of operation of an adaptive filter in a nonstationary SOE are illustrated in Figure 
10.12(b). The SOE of the adaptive filter becomes nonstationary if x(n, ¢) or y(n, ¢) or 
both are nonstationary. The nonstationarity of the input is more severe than that of the 
desired response because it may affect the invertibility of R(7). Since the adaptive filter 
has to first acquire and then track the optimum filter, tracking is a steady-state property. 
Therefore, in general, the speed of adaptation (a transient-phase property) and the tracking 
capability (a steady-state property) are two different characteristics of the adaptive filter. 
Clearly, tracking is feasible only if the statistics of the SOE change “slowly” compared to 
the speed of tracking of the adaptive filter. These concepts will become more precise in 
Section 10.8, where we discuss the tracking properties of adaptive filters. 


10.2.4 Some Practical Considerations 


The complexity of the hardware or software implementation of an adaptive filter is basi- 
cally determined by the following factors: (1) the number of instructions per time update 
or computing time required to complete one time updating; (2) the number of memory 
locations required to store the data and the program instructions; (3) the structure of infor- 
mation flow in the algorithm, which is very important for implementations using parallel 
processing, systolic arrays, or VLSI chips; and (4) the investment in hardware design tools 
and software development. We focus on implementations for general-purpose computers 
or special-purpose digital signal processors that basically involve programming in a high 
level or assembly language. More details about DSP software development can be found in 
Embree and Kimble (1991) and in Lapsley et al. (1997). 
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The digital implementation of adaptive filters implies the use of finite-word-length 
arithmetic. As a result, the performance of the practical (finite-precision) adaptive filters 
deviates from the performance of ideal (infinite-precision) adaptive filters. Finite-precision 
implementation affects the performance of adaptive filters in several complicated ways. The 
major factors are (1) the quantization of the input signal(s) and the desired response, (2) 
the quantization of filter coefficients, and (3) the roundoff error in the arithmetic operations 
used to implement the adaptive filter. The nonlinear nature of adaptive filters coupled with 
the nonlinearities introduced by the finite-word-length arithmetic makes the performance 
evaluation of practical adaptive filters extremely difficult. Although theoretical analysis 
provides insight and helps to clarify the behavior of adaptive filters, the most effective way 
is to simulate the filter and measure its performance. 

Finite precision affects two important properties of adaptive filters, which, although 
related, are not equivalent. Let us denote by cip(7) and ¢fp (1) the coefficient vectors of the 
filter implemented using infinite- and finite-precision arithmetic, respectively. An adaptive 
filter is said to be numerically stable if the difference vector Cip (1) — Cfp (1) remains always 
bounded, that is, the roundoff error propagation system is stable. Numerical stability is 
an inherent property of the adaptive algorithm and cannot be altered by increasing the 
numerical precision. Indeed, increasing the word length or reorganizing the computations 
will simply delay the divergence of an adaptive filter; only actual change of the algorithm 
can stabilize an adaptive filter by improving the properties of the roundoff error propagation 
system (Ljung and Ljung 1985; Cioffi 1987). 

The numerical accuracy of an adaptive filter measures the deviation, at steady state, of 
any obtained estimates from theoretically expected values, due to roundoff errors. Numerical 
accuracy results in an increase of the output error without catastrophic problems and can 
be reduced by increasing the word length. In contrast, lack of numerical stability leads 
to catastrophic overflow (divergence or blowup of the algorithm) as a result of roundoff 
error accumulation. Numerically unstable algorithms converging before “explosion” may 
provide good numerical accuracy. Therefore, although the two properties are related, one 
does not imply the other. 

Two other important issues are the sensitivity of an algorithm to bad or abnormal input 
data (e.g., poorly exciting input) and its sensitivity to initialization. All these issues are very 
important for the application of adaptive algorithms to real-world problems and are further 
discussed in the context of specific algorithms. 


10.3 METHOD OF STEEPEST DESCENT 


Most adaptive filtering algorithms are obtained by simple modifications of iterative methods 
for solving deterministic optimization problems. Studying these techniques helps one to 
understand several aspects of the operation of adaptive filters. In this section we discuss 
gradient-based optimization methods because they provide the ground for the development 
of the most widely used adaptive filtering algorithms. 

As we discussed in Section 6.2.1, the error performance surface of an optimum filter, 
in a Stationary SOE, is given by 


P(c) = Py —c#d—d¥c+ce7Re (10.3.1) 


where Py = E {ly(n) |}. Equation (10.3.1) is a quadratic function of the coefficients and 
represents a bowl-shaped surface (when R is positive definite) and has a unique minimum 
at Cy (optimum filter). There are two distinct ways to find the minimum of (10.3.1): 


1. Solve the normal equations Re = d, using a direct linear system solution method. 
2. Find the minimum of P(c), using an iterative minimization algorithm. 


Although direct methods provide the solution in a finite number of steps, sometimes we 
prefer iterative methods because they require less numerical precision, are computationally 
less expensive, work when R is not invertible, and are the only choice for nonquadratic 
performance functions. 

In all iterative methods, we start with an approximate solution (a guess), which we 
keep changing until we reach the minimum. Thus, to find the optimum Cy, we start at some 
arbitrary point co, usually the null vector co = 0, and then start a search for the “bottom 
of the bowl.” The key is to choose the steps in a systematic way so that each step takes us 
to a lower point until finally we reach the bottom. What differentiates various optimization 
algorithms is how we choose the direction and the size of each step. 


Steepest-descent algorithm (SDA) 


If the function P(c) has continuous derivatives, it is possible to approximate its value 
at an arbitrary neighboring point c + Ac by using the Taylor expansion 


M 2 
P(c+ Ac) = P(e) + s a Ac; + Ey yag u ae da jeers (10.3.2) 
i=l t 


fat g=1 
or more compactly 
P(c + Ac) = P(c) + (Ac)’ VP(c) + 3(Ac) [V7 P(c)](Ac) + -- (10.3.3) 


where V P(c) is the gradient vector, with elements 0 P(c)/dc;, and V* P(c) is the Hessian 
matrix, with elements 3? P(c) /(dc,dc;). For simplicity we consider filters with real coef- 
ficients, but the conclusions apply when the coefficients are complex. For the quadratic 
function (10.3.1), we have 


V P(c) = 2(Re — d) (10.3.4) 
V* P(c) =2R (10.3.5) 


and the higher-order terms are zero. For nonquadratic functions, higher-order terms are 
nonzero, but if || Ac|| is small, we can use a quadratic approximation. We note that if 
V P(co) = 0 and R is positive definite, then c, is the minimum because (Ac)? [V* P(c,)]- 
(Ac) > 0 for any nonzero Ac. Hence, if we choose the step Ac so that (Ac)' VP(c) < 0, 
we will have P(c + Ac) < P(c), that is, we make a step to a point closer to the minimum. 
Since (Ac)! V P(e) = || Ae|||| V P(c)|| cos 0, the reduction in MSE is maximum when Ac = 
—V P(e). For this reason, the direction of the negative gradient is known as the direction 
of steepest descent. This leads to the following iterative minimization algorithm 


Ch = Ce_-1 + LI-V P(cg_1)] k>0 (10.3.6) 


which is known as the method of steepest descent (Scales 1985). The positive constant ju, 
known as the step-size parameter, controls the size of the descent in the direction of the 
negative gradient. The algorithm is usually initialized with co = 0. The steepest-descent 
algorithm (SDA) is illustrated in Figure 10.13 for a single-parameter case. 

For the cost function in (10.3.1), the SDA becomes 


Ce = Ce_-1 + 2u(d — Reg_}) = I — 2uR)cg_) + 2d (10.3.7) 


which is a recursive difference equation. Note that k denotes an iteration in the SDA and 
has nothing to do with time. However, this iterative optimization can be combined with 
filtering to obtain a type of “asymptotically” optimum filter defined by 


e(n, €) = y(n, ¢) — e_ x(n, €) (10.3.8) 
Cn = Cp—-1 + 2(d — Rep—1) (10.3.9) 


and is further discussed in Problem 10.2. 
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There are two key performance factors in the design of iterative optimization algo- 
rithms: stability and rate of convergence. 
Stability 


An algorithm is said to be stable if it converges to the minimum regardless of the starting 
point. To investigate the stability of SDA, we rewrite (10.3.7) in terms of the coefficient 
error vector 


HAG —c k=O (10.3.10) 
as & =(T—-2uR)&_1  k>=0 (10.3.11) 


which is a homogeneous difference equation. Using the principal-components transforma- 
tion R = QAQ? (see Section 3.5), we can write (10.3.11) as 


@ = (-2uAyy_, k>0 (10.3.12) 
where ]=V% k>=0 (10.3.13) 


is the transformed coefficient error vector. Since A is diagonal, (10.3.12) consists of a set 
of M decoupled first-order difference equations 


Ch i = (1 — 2mriey_ 1; i=1,2,...,.M,k>0 (10.3.14) 
with each describing a natural mode of the SDA. The solutions of (10.3.12) are given by 
&;= C= Shi'a. k>0 (10.3.15) 
If foralll1<i<M 
—1<1-—2pi; <1 (10.3.16) 
or equivalently O<uU< = (10.3.17) 


then ol ;,1 <i < M, tends to zero as k > oo. This implies that cx, converges exponentially 


to ¢, as k + oo because ||¢; || = |Q7 cx || = ||&ll. If R is positive definite, its eigenvalues 
are positive and 


O<p< (10.3.18) 


Amax 
provides a necessary and sufficient condition for the convergence of SDA. 
To investigate the transient behavior of the SDA as a function of k, we note that using 
(10.3.10), (10.3.11), and (10.3.14), we have 


M 


Chi = Co,i + D~ gike, (1 — 2uAi)* (10.3.19) 
i=l 


where c,,; are the optimum coefficients and qjx the elements of the eigenvector matrix Q. 
The MSE at step k is 


M 
Py = Py + > Ag (1 — 2mdj)* |e iI? (10.3.20) 
i=1 
and can be obtained by substituting (10.3.19) in (10.3.1). If wu satisfies (10.3.18), we have 
limx— oo Pk = Po and the MSE converges exponentially to the optimum value. The curve 
obtained by plotting the MSE P; as a function of the number of iterations k is known as the 
learning curve. 


Rate of convergence 


The rate (or speed) of convergence depends upon the algorithm and the nature of the 
performance surface. The most influential effect is inflicted by the condition number of the 
Hessian matrix that determines the shape of the contours of P(c). When P(c) is quadratic, 
it can be shown (Luenberger 1984) that 


X(R)-1 
X(R) +1 


where V(R) = Amax/Amin is the condition number of R. If we recall that the eigenvectors 
corresponding to Amin and Amax point to the directions of minimum and maximum curvature, 
respectively, we see that the convergence slows down as the contours become more eccentric 
(flattened). For circular contours, that is, when V(R) = 1, the algorithm converges in one 
step. We stress that even if the M — | eigenvalues of R are equal and the remaining one is 
far away, still the convergence of the SDA is very slow. 

The rate of convergence can be characterized by using the time constant t; defined by 


2 
P(e) < P(ex-1) (10.3.21) 


1 1 
1 — 2a; = exp (-=) SiS (10.3.22) 


which provides the time (or number of iterations) it takes for the ith mode cx; of (10.3.19) 
to decay to |/e of its initial value co,;. When xp < 1, we obtain 


1 


sad 10.3.23 
1 ah, ( ) 
In a similar fashion, the time constant T; mse for the MSE P; can be shown to be 
1 
Timse ~ Apa; (10.3.24) 


by using (10.3.20) and (10.3.22). 

Thus, for all practical purposes, the time constant (for coefficient c, or for MSE P;) 
of the SDA is tT ~ 1/(UAmin), which in conjunction with w < 1/Amax results in tT > 
Amax/Amin- Hence, the larger the eigenvalue spread of the input correlation matrix R, the 
longer it takes for the SDA to converge. 

In the following example, we illustrate above-discussed properties of the SDA by using 
it to compute the parameters of a second-order forward linear predictor. 


EXAMPLE 10.3.1. Consider a signal generated by the second-order autoregressive AR(2) process 
x(n) +a,x(n — 1) +anx(n — 2) = w(n) (10.3.25) 


where w(n) ~ WGN(O, o>). Parameters a, and az are chosen so that the system (10.3.25) is 
minimum-phase. We want to design an adaptive filter that uses the samples x(n — 1) and x(n —2) 
to predict the value x(n) (desired response). 

If we multiply (10.3.25) by x(n —k), fork = 0, 1, 2, and take the mathematical expectation 
of both sides, we obtain a set of linear equations 


(0) + ayr(1) + agr(2) = 0%, (10.3.26) 
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r(1) +ajr(0) + agr(1) = 0 (10.3.27) 
r(2) +ayr(1) + agr(0) = 0 (10.3.28) 


which can be used to express the autocorrelation of x(m) in terms of model parameters a, a2, 
and ot. Indeed, solving (10.3.26) through (10.3.28), we obtain 


1 2 
Wij=e2 =! ___ Fu 
1a) (1 +a)?- a? 
-_ 
rq)= ices (10.3.29) 
Q) +11), 
r = a2 Tatas r 


We choose o% = 1, so that 


2 - de w@lat az) — aj] 92 


a Leas (10.3.30) 
The coefficients of the optimum predictor 
p(n) = X(n) = co x(n — 1) +.€9,2x(n — 2) (10.3.31) 
are given by (see Section 6.5) 
r(O)co,1 +r()eo2 = rd) (10.3.32) 
r()ceo,1 +rO)co2 =r) (10.3.33) 
with PL =r(0) + r(1eo,1 +r O)co,2 (10.3.34) 
whose comparison with (10.3.26) through (10.3.28) shows that cg 1 = —aj, Co,.2 = —a2, and 
pf = con as expected. 
The eigenvalues of the input correlation matrix 
R= i a (10.3.35) 
are M2= (: = Jo? (10.3.36) 
l+a) 


from which the eigenvalue spread is 


Xd l-—ay+ 
th) = = 4 (10.3.37) 
2 l+a,j+a. 
which, if a2 > 0 and a; < 0, is larger than 1. 
Now we perform MATLAB experiments with varying eigenvalue spread Vv (R) and step-size 


parameter jz. In these experiments, we choose a, so that on = 1. The SDA is given by 


ce = [ext cea)" = ep—1 + 2u(d — Reg_1) 
where d=[r(1)rQ)]’ = and = eg = [00] 


We choose two different sets of values for aj and az, one for a small and the other for a 
large eigenvalue spread. These values are shown in Table 10.2 along with the corresponding 
eigenvalue spread 4’ (R) and the MMSE Toys 


TABLE 10.2 
Parameter values used in the SDA for the second-order forward 
prediction problem. 


Eigenvalue spread ay a2 Ay do X(R) o2, 
Small —0.1950 0.95 1.1 0.9 1.22 0.0965 


Large —1.5955 0.95 1.818 0.182 9.99 0.0322 


Parameters 


Using each set of parameter values, the SDA is implemented starting with the null coefficient 
vector ¢g with two values of step-size parameters. To describe the transient behavior of the 
algorithm, it is informative to plot the trajectory of cx) versus cg, as a function of the iteration 
index k along with the contours of the error surface P(cx). The trajectory of cx, begins at the 
origin ¢9 = 0 and ends at the optimum value cg = —[a 1 a]’. This illustration of the transient 
behavior can also be obtained in the domain of the transformed error coefficients G. Using 
(10.3.15), we see these coefficients are given by 


al k~ 
“ih Cr 1 a — 21A1) a) 1 
{= Ee = a (10.3.38) 
k,2 (1 — 2A) 0,2 
where % from (10.3.10) and (10.3.13) is given by 
~/ 
of “0.1 3 a 
GH = | | = Q7& =O" (eo — eo) = —Q’ eo = Q™ (10.3.39) 
“01 a2 


Thus the trajectory of ¢, begins at ¢) and ends at the origin ¢, = 0. The contours of the MSE 
function in the transformed domain are given by Py — Po. From (10.3.20), these contours are 
given by 


z 
Py — Pf = SE)? = ME 1)? + Ao)? (10.3.40) 
i=l 
Small eigenvalue spread and overdamped response. For this experiment, the parameter val- 
ues were selected to obtain the eigenvalue spread approximately equal to 1 [V(R) = 1.22]. The 
step size selected was jz = 0.15, which is less than 1/Amax = 1/1.1 = 0.9 for convergence. For 
this value of jz, the transient response is overdamped. Figure 10.14 shows four graphs indicating 


oo 1 
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FIGURE 10.14 
Performance curves for the steepest-descent algorithm used in the linear prediction problem 
with step-size parameter = 0.15 and eigenvalue spread V(R) = 1.22. 
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the behavior of the algorithm. In the graph (a), the trajectory of & is shown for 0 < k < 15 
along with the corresponding loci oH for a fixed value of Py — Po. The first two loci for k = 0 
and 1 are numbered to show the direction of the trajectory. Graph (b) shows the corresponding 
trajectory and the contours for ¢,. Graph (c) shows plots of cg, and cx,2 as a function of iteration 
step k, while graph (d) shows a similar learning curve for the MSE P;. Several observations 
can be made about these plots. The contours of constant é, are almost circular since the spread 
is approximately 1, while those of ec, are somewhat elliptical, which is to be expected. The 
trajectories of e, and ¢; as a function of k are normal to the contours. The coefficients converge 
to their optimum values in a monotonic fashion, which confirms the overdamped nature of the 
response. Also this convergence is rapid, in about 15 steps, which is to be expected for a small 
eigenvalue spread. 


Large eigenvalue spread and overdamped response. For this experiment, the parameter val- 
ues were selected so that the eigenvalue spread was approximately equal to 10 [4 (R) = 9.99]. 
The step size was again selected as px = 0.15. Figure 10.15 shows the performance plots for 
this experiment, which are similar to those of Figure 10.14. The observations are also similar 
except for those due to the larger spread. First, the contours, even in the transformed domain, 
are elliptical; second, the convergence is slow, requiring about 60 steps in the algorithm. The 
transient response is once again overdamped. 
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FIGURE 10.15 


Performance curves for the steepest-descent algorithm used in the linear prediction problem 
with step-size parameter jz = 0.15 and eigenvalue spread V(R) = 10. 


Large eigenvalue spread and underdamped response. Finally, in the third experiment, we con- 
sider the model parameters of the above case and increase the step size to uw = 0.5 (< 1/Amax = 
0.55) so that the transient response is underdamped. Figure 10.16 shows the corresponding plots. 
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FIGURE 10.16 
Performance curves for the steepest-descent algorithm used in the linear prediction problem 
with eigenvalue spread V(R) = 10 and varying step-size parameters w = 0.15 and uw = 0.5. 


Note that the coefficients converge in an oscillatory fashion; however, the convergence is fairly 
rapid compared to that of the overdamped case. Thus the selection of the step size is an important 
design issue. 


Newton’s type of algorithms 


Another family of algorithms with a faster rate of convergence includes Newton’s 
method and its modifications. The basic idea of Newton’s method is to achieve convergence 
in one step when P(c) is quadratic. Thus, if cx, is to be the minimum of P(c), the gradient 
V P(ex) of P(c) evaluated at cx, (10.2.19) should be zero. From (10.2.19), we can write 


V P(ex) = VP(ex-1) + V7 P(ee_-1) Ace = 0 (10.3.41) 

Thus V P(cx) = 0 leads to the step increment 
Acy = —[V7P(eg-1) 7! V P(ex—1) (10.3.42) 

and hence the adaptive algorithm is given by 
Ck = Ce—-1 — MLV? P(eg-1))'V P(ce-1) (10.3.43) 


where pz > 0 is the step size. For quadratic error surfaces, from (10.3.4) and (10.3.5), we 
obtain with « = | 


ey = Ce—1 — [V* P(ex—1) 1! V P(ce-1) = Ck-1 — (Ce-1 —R7'd) =e ~—(10.3.44) 


which shows that indeed the algorithm converges in one step. 
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For the quadratic case, since V-P (¢x—1) = 2R from (10.3.1), we can express Newton’s 
algorithm as 


ce = C1 — UR7!V P(eg_1) (10.3.45) 


where yu is the step size that regulates the convergence rate. Other modified Newton meth- 
ods replace the Hessian matrix V2 P(cx_1) with another matrix, which is guaranteed to be 
positive definite and, in some way, close to the Hessian. These Newton-type algorithms 
generally provide faster convergence. However, in practice, the inversion of R is numeri- 
cally intensive and can lead to a numerically unstable solution if special care is not taken. 
Therefore, the SDA is more popular in adaptive filtering applications. 

When the function P(c) is nonquadratic, it is approximated locally by a quadratic 
function that is minimized exactly. However, the step obtained in (10.3.42) does not lead to 
the minimum of P(c), and the iteration should be repeated several times. A more detailed 
treatment of linear and nonlinear optimization techniques can be found in Scales (1985) 
and in Luenberger (1984). 


10.4 LEAST-MEAN-SQUARE ADAPTIVE FILTERS 


In this section, we derive, analyze the performance, and present some practical applications 
of the least-mean-square (LMS) adaptive algorithm. The LMS algorithm, introduced by 
Widrow and Hoff (1960), is widely used in practice due to its simplicity, computational 
efficiency, and good performance under a variety of operating conditions. 


10.4.1 Derivation 


We first present two approaches to the derivation of the LMS algorithm that will help the 
reader to understand its operation. The first approach uses approximation to the gradient 
function while the second approach uses geometric arguments. 


Optimization approach. The SDA uses the second-order moments R and d to itera- 
tively compute the optimum filter ¢, = R~'d, starting with an initial guess, usually co = 0, 
and then obtaining better approximations by taking steps in the direction of the negative 
gradient, that is, 


Ck = ex—1 + U[—V P(ex-1)] (10.4.1) 
where VP(cg-1) = 2(Rex_1 — d) (10.4.2) 
is the gradient of the performance function (10.3.1). In practice, where only the input {x(j)}5 
and the desired response {y(j)}j are known, we can only compute an estimate of the “true” 
or exact gradient (10.4.2) using the available data. To develop an adaptive algorithm from 
(10.4.1), we take the following steps: (1) replace the iteration subscript k by the time index 


n; and (2) replace R and d by their instantaneous estimates x(n)x" (n) and x(n) y*(n), 
respectively. The instantaneous estimate of the gradient (10.4.2) becomes 


V P(ey—1) = 2Rey_1 —2d © 2x(n)x" (n)e(n—1)—2x(n) y*(n) = —2x(n)e*(n) (10.4.3) 
where e(n) = y(n) — c# (n — 1)x(n) (10.4.4) 


is the a priori filtering error. The estimate (10.4.3) also can be obtained by starting with the 
approximation P(c) ~|e(n)|? and taking its gradient. The coefficient adaptation algorithm 
is 


e(n) = e(n — 1) + 2ux(n)e*(n) (10.4.5) 


which is obtained by substituting (10.4.3) and (10.4.4) in (10.4.1). The step-size parameter 
2 is also known as the adaptation gain. 

The LMS algorithm, specified by (10.4.5) and (10.4.4), has both important similari- 
ties to and important differences from the SDA (10.3.7). The SDA contains deterministic 
quantities while the LMS operates on random quantities. The SDA is not an adaptive algo- 
rithm because it only depends on the second-order moments R and d and not on the SOE 
{x(n, €), y(n, €)}. Also, the iteration index k has nothing to do with time. Simply stated, 
the SDA provides an iterative solution to the linear system Re = d. 


Geometric approach. Suppose that an adaptive filter operates in a stationary signal 
environment seeking the optimum filter c,. At time n the filter has access to input vector 
x(n), the desired response y(7), and the previous or old coefficient estimate c(m — 1). Its 
goal is to use this information to determine a new estimate ¢(7) that is closer to the optimum 
vector C, or equivalently to choose ¢e(7) so that ||¢(7) || < ||¢(7—1)||, where c(n) = e(n)—¢y 
is the coefficient error vector given by (10.2.24). Eventually, we want ||¢()|| to become 
negligible as n > oo. 

The vector ¢(n — 1) can be decomposed into two orthogonal components 


ein —1)=¢,(2—1)4+ el(n —1) (10.4.6) 


one parallel and one orthogonal to the input vector x(7), as shown in Figure 10.17(a). The 
response of the error filter ¢(n — 1) to the input x(7) is 


3(n) = 7 (n — 1)x(n) = &4 (n — 1)x(n) (10.4.7) 


which implies that C(n-—1)= y(n Js (n) (10.4.8) 


I|x(7)|| 
which can be verified by direct substitution in (10.4.7). Note that x(m)/||x(7)|| is a unit 
vector along the direction of x(7). 


—2uc,(n — 1) 


(a) (b) 


FIGURE 10.17 
The geometric approach for the derivation of the LMS algorithm. 


If we only know x(n) and y(n), the best strategy to decrease C(n) is to choose ¢(n) = 
et (n — 1), or equivalently subtract ¢,(n — 1) from ¢(m — 1). From Figure 10.17(q@) note 
that as long as ¢,(n — 1) 4 90, |/e(1)|| = Jet (n — 1)|| < |le.(@ — 1)||. This suggests the 
following adaptation algorithm 

~ y(n) 


C(n) = ¢(n — 1) — 10.4.9 
c(n) = e(n — 1) Ale x(n) ( ) 
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which guarantees that ||¢(1)|| < ||¢(@ — 1)|| as long as O < 4 < 2 and y(n) 4 0, as shown 
in Figure 10.17(b). The best choice clearly is (4 = 1. 

Unfortunately, the signal y(n) is not available, and we have to replace it with some 
reasonable approximation. From (10.2.18) and (10.2.10) we obtain 


en) = e(n) — eo(n) = y(n) — $n) — y(n) + So(n) = So(n) — F(n) 


(10.4.10) 
= (ef! — e# (n — 1)]x(n) = —€" (n — 1) x(n) = — 5) 
where we have used (10.4.7). Using the approximation 
e(n) = e(n) — @(n) & e(n) 
we combine it with (10.4.10) to get 
~ e*(n) 
e(n) = e(n — 1) + #——5 x(n) (10.4.11) 
I[x(72)|| 


which is knownas the normalized LMS algorithm. Note that the effective step size jx /||x(n)||* 
is time-varying. The LMS algorithm in (10.4.5) follows if we set ||x()|| = 1 and choose 


f= 2p. 
LMS algorithm. The LMS algorithm can be summarized as 
3(n) = e7 (n — 1)x(n) filtering 
e(n) = y(n) — Y(n) error formation (10.4.12) 
e(n) = e(n — 1) + 2pex(n)e*(n) coefficient updating 


where jz is adaptation step size. The algorithm requires 2M + 1 complex multiplications 
and 2M complex additions. Figure 10.18 shows an implementation of an FIR adaptive 
filter using the LMS algorithm, which is implemented in MATLAB using the function 
[yhat,c]=firlms(x,y,M,mu). The a posteriori form of the LMS algorithm is developed 
in Problem 10.9. 


FIGURE 10.18 
An FIR adaptive filter realization using the LMS algorithm. 


10.4.2 Adaptation in a Stationary SOE 


In the sequel, we study the stability and steady-state performance of the LMS algorithm in 
a stationary SOE; that is, we assume that the input and the desired response processes are 
jointly stationary. In theory, the goal of the LMS adaptive filter is to identify the optimum 
filter ¢y = R~'d from observations of the input x(n) and the desired response 


y(n) = ec? x(n) + e,(n) (10.4.13) 


The optimum error e,(7) is orthogonal to the vector x(n); that is, E{x(n)e*(n)} = 0 and 
acts as measurement or output noise, as shown in Figure 10.19. 


FIGURE 10.19 527 
LMS algorithm in a stationary SOE. SECTION 10.4 
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The first step in the statistical analysis of the LMS algorithm is to determine a difference 
equation for the coefficient error vector ¢(n). To this end, we subtract c, from both sides of 
(10.4.5), to obtain 


C(n) = C(n — 1) + 2pex(n)e*(n) (10.4.14) 
which expresses the LMS algorithm in terms of the coefficient error vector. We next use 


(10.4.12) and (10.4.13) in (10.4.14) to eliminate e(1) by expressing it in terms of ¢(n — 1) 
and e,(n). The result is 

c(n) = [I- 2ux(n)x" (n)]e(n — 1) + 2px(n)eF(n) (10.4.15) 
which is a time-varying forced or nonhomogeneous stochastic difference equation. The irre- 
ducible error e,(”) accounts for measurement noise, modeling errors, unmodeled dynamics, 
quantization effects, and other disturbances. The presence of e,(n) prevents convergence 
because it forces ¢(n) to fluctuate around zero. Therefore, the important issue is the BIBO 
stability of the system (10.4.15). From (10.2.28), we see that ||¢(7)|| is bounded in mean 
square if we can show that E{¢(n)} > 0asn — oo and var{Cx(n)} is bounded for all n. To 
this end, we develop difference equations for the mean value E {¢(n)} and the correlation 
matrix 


O(n) = E(e(n)e" (n)} (10.4.16) 
of the coefficient error vector ¢(1). As we shall see, the MSD and the EMSE can be expressed 


in terms of matrices ®(7) and R. The time evolution of these quantities provides sufficient 
information to evaluate the stability and steady-state performance of the LMS algorithm. 


Convergence of the mean coefficient vector 
If we take the expectation of (10.4.15), we have 
E{@(n)} = E{e(n — 1)} — 2wE{x(n)x" (nyé(n — 1)} (10.4.17) 
because E{x(n)e*(n)} = 0 owing to the orthogonality principle. The computation of the 
second term in (10.4.17) requires the correlation between the input signal and the coefficient 
error vector. 


If we assume that x() and ¢(n — 1) are statistically independent, (10.4.17) simplifies 
to 


E{e(n)} = I — 2uR)E{e(n — 1)} (10.4.18) 
which has the same form as (10.3.11) for the SDA. Therefore, ¢(7) converges in the MS 
sense, that is, limy+o E{¢(n)} = 0, if the eigenvalues of the system matrix (I — 2uR) 


are less than 1. Hence, if R is positive definite and Amax is its maximum eigenvalue, the 
condition 


0 <2u < (10.4.19) 


Amax 
ensures that the LMS algorithm converges in the MS sense [see the discussion following 
(10.2.27)]. 
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Independence assumption. The independence assumption between x(n) and ¢(n — 1) 
was critical to the derivation of (10.4.18). To simplify the analysis, we make the following 
independence assumptions (Gardner 1984): 


A1 The sequence of input data vectors x(n) is independently and identically distributed 
with zero mean and correlation matrix R. 
A2 The sequences x(n) and e,(n) are independent for all n. 


From (10.4.15), we see that ¢(n — 1) depends on ¢(0), {x(k}, and {é, (KY. Since 
the sequence x(7) is IID and the quantities x(m) and e,(m) are independent, we conclude 
that x(n), eg(n), and ¢(n — 1) are mutually independent. This result will be used several 
times to simplify the analysis of the LMS algorithm. 

The independence assumption A1, first introduced in Widrow et al. (1976) and in Mazo 
(1979), ignores the statistical dependence among successive input data vectors; however, 
it preserves sufficient statistical information about the adaptation process to lead to useful 
design guidelines. Clearly, for FIR filtering applications, the independence assumption is 
violated because two successive input data vectors x(n) and x(n + 1) have M — 1 common 
elements (shift-invariance property). 


Evolution of the coefficient error correlation matrix 
The MSD can be expressed in terms of the trace of the correlation matrix’ ®(n), that 
is, 
D(n) = tr[®(n)] (10.4.20) 


which can be easily seen by using (10.2.29) and the definition of trace. If we postmultiply 
both sides of (10.4.15) by their respective Hermitian transposes and take the mathematical 
expectation, we obtain 


®(n) = E{en)e" (n)} 
= E{{I — 2ux(n)x" (n)Je(n — 14 (n — IT — 2uex(n)x4 (n)]7} 
+ 2wE{{I — 2ux(n)x" (n)Je(n — Leg (n)x" (n)} (10.4.21) 
+ 2uE{x(n)ex(n)ée" (n — IT — 2ux(n)x" (n)]"} 
+ 4° E{x(n)e% (n)eo(n)x" (n)} 


From the independence assumptions, é, (7) is independent with ¢(n— 1) and x(n). Therefore, 
the second and third terms in (10.4.21) vanish, and the fourth term is equal to 4 PoR. If 
we expand the first term, we obtain 


©(n) = ®(n — 1) — 2u[RO(n — 1) + On — 1)R) 447A +4y7P,R (104.22) 
where A= Efx(n)x”? en — YO (n — I)x(n)x? (n)} (10.4.23) 


and the terms R®(n — 1) and ®(n — 1)R have been computed by using the mutual inde- 
pendence of x(n), ¢(n — 1), and eg(n). 

The computation of matrix A can be simplified if we make additional assumptions about 
the statistical properties of x(n). As shown in Gardner (1984), development of a recursive 
relation for the elements of ®() using only the independence assumptions requires the 
products with and the inversion of a M 2 x M? matrix, where M is the size of x(n). 

The evaluation of this term when x(7) ~ IID, an assumption that is more appropriate 
for data transmission applications, is discussed in Gardner (1984). The computation for 
x(n) being a spherically invariant random process (SIRP) is discussed in Rupp (1993). 
SIRP models, which include the Gaussian distribution as a special case, provide a good 


‘Note that when (10.4.19) holds, limp oo E{€(n)} = 0, and therefore ®(n) provides asymptotically the covariance 
of ¢(n). 


characterization of speech signals. However, independently of the assumption used, the 
basic conclusions remain the same. 

Assuming that x(7) is normally distributed, that is, x(n) ~ N(0,R), a significant 
amount of simplification can be obtained. Indeed, in this case we can use the moment 
factorization property for normal random variables to express fourth-order moments in 
terms of second-order moments (Papoulis 1991). As we showed in Section 3.2.3, if z1, Z2, 
Z3, and z4 are complex-valued, zero-mean, and jointly distributed normal random variables, 
then 


E{z1252324} = Ef{ziz5}E{z3z4} + E{zizq}E{z5z3} (10.4.24) 
or if they are real-valued, then 
E {21222324} = Ef{zizajE{z3z4} + Ef{ziz3}E{z2z4} + E{ziza} E{z223} (10.4.25) 
Using direct substitution of (10.4.24) or (10.4.25) in (10.4.23), we can show that 
RO(m7—1)R+Rtr [ROM —- 1 complex case 
e sna, = aes ca = - real Ne C029) 


Finally, substituting (10.4.26) in (10.4.22), we obtain a difference equation for ®(n). This 
is summarized in the following property: 


PROPERTY 10.4.1. Using the independence assumptions Al and A2, and the normal distribu- 
tion assumption of x(), the correlation matrix of the coefficient error vector ¢(n) satisfies the 
difference equation 


(n) = O(n — 1) — 2u[RO(n — 1) + O(n — 1)R] 


4 > 2 (10.4.27) 
+4yu-RO(n — 1I)R+4+ 4u°Rtr[RO(n — 1)]) + 4° POR 
in the complex case and 
®(n) = O(n — 1) — 2p[RO(n — 1) + O(n — 1)R] 
(10.4.28) 


+ 8u7RO(n — DR 4+ 4y7R te [RO(n — 1)] +47 PWR 


in the real case. Both relations are matrix difference equations driven by the constant term 
4? PR. 


The presence of the term Ay? PR in (10.4.27) or (10.4.28) implies that ®() will never 
become zero, and as a result the coefficients of the LMS adaptive filter will always fluctuate 
about their optimum settings, which prevents convergence. It has been shown (Bucklew 
et al. 1993) that asymptotically ¢(1) follows a zero-mean normal distribution. The amount 
of fluctuation is measured by matrix ®(n). In contrast, the absence of a driving term in 
(10.4.18) allows the convergence of E{e(n)} to the optimum vector Cp. 

Since there are two distinct forms for the difference equation of ®(7), we will consider 
the real case (10.4.28) for further discussion. Similar analysis can be done for the complex 
case (10.4.27), which is undertaken in Problem 10.11. To further simplify the analysis, we 
transform ®(n) to the principal coordinate space of R using the spectral decomposition 


Q’RQ=A 
by defining the matrix O(n) = Q7 &(n)Q (10.4.29) 
which is symmetric and positive definite [when ®(7) is positive definite]. 


If we pre- and postmultiply (10.4.28) by Q’ and Q and use Q7Q = QQ’ = I, we 


obtain 
O(n) = O(n — 1) — 2n[AO(n — 1) + OC — 1)A 
(n) (n — 1) — 2p[AO(n — 1) + O(2 — 1)A] (10.430) 
+8u7AQ(n — 1)A+4y7A tr[AQ(n — 1)] + 4u2P,A 


which is easier to work with because of the diagonal nature of A. For any symmetric and 
positive definite matrix ©, we have |@;; (n)|? < 6;;0;;. Hence, the convergence of the 
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diagonal elements ensures the convergence of the off-diagonal elements. This observation 
and (10.4.30) suggest that to analyze the LMS algorithm, we should extract from (10.4.30) 
the equations for the diagonal elements 


O(n) = [41(n) O2(n) «++ Om(n)]" (0.4.31) 

of O(n) and form a difference equation for the vector 0(n). Indeed, we can easily show that 
0(n) = BO(n — 1) + 407 P,A (10.4.32) 

where BA A(p) +4y7aal (10.4.33) 
AZ PAL Ad Amd! (10.4.34) 

A(p) © diag(p1, 02,..-, om} (10.4.35) 


Py = 1—4urg +8202 = (1 — 2d)? +4702 >0 21<k<M (104,36) 
and A, are the eigenvalues of R. The solution of the vector difference equation (10.4.32) is 
nl 
0(n) = B"0(0) + 4:7 P, > B/d (10.4.37) 
j=0 
and can be easily found by recursion. 


The stability of the linear system (10.4.32) is determined by the eigenvalues of the 
symmetric matrix B. Using (10.4.33) and (10.4.35), for an arbitrary vector z, we obtain 


M 
z' Bz = 2! A(p)z +472)? = + Pure + 4p2 (at zy? (10.4.38) 

k=1 
where we have used (10.4.36). Hence (10.4.38), for z 4 0, implies that z' Bz > 0, that 
is, the matrix B is positive definite. Since matrix B is symmetric and positive definite, its 
eigenvalues A; (B) are real and positive. The system (10.4.37) will be BIBO stable if and 

only if 

0 < Ax (B) < 1 1<k<M (10.4.39) 
To find the range of jz that ensures (10.4.39), we use the Gerschgorin circles theorem (Noble 
and Daniel 1988), which states that each eigenvalue of an M x M matrix B lies in at least 
one of the disks with center at the diagonal element by, and radius equal to the sum of 


absolute values |bxj|, j 4 k, of the remaining elements of the row. Since the elements of B 
are positive, we can easily see that 


M M 
M(B) — dik < Dx or ARB) < py +47 AK DA: 

j=l j=l 
J#k 

using (10.4.33). Hence using (10.4.36), we see the eigenvalues of B satisfy (10.4.39) if 

1 — 4ydry + 8702 + 47a, trR <1 
or —pArg + Qu Ae +wrrtrR <0 
which implies that ~ > 0 and 
1 1 
ae 
A+tR trR 


because A, > 0 for all k. In conclusion, if the adaptation step jz satisfies the condition 


2M 


1 
0<2 — 10.4.40 
SEs ( ) 


then the system (10.4.37) is stable and therefore the sequence 0(n) converges. 


PROPERTY 10.4.2. When the stability condition (10.4.40) holds, the solution (10.4.37) of the 531 


difference equation (10.4.32) can be written as SECTION 10.4 
n Least-Mean-Square 
6(n) = B’[6(0) — 0(co)] + 8(co) (10.4.41) Adaptive Filters 


where 0(0) is the initial value and @(0o) is the steady-state value of O(n). 


Proof. Using the identity 
n—1 : 
> B/ = d-B")a-B) | =(-B)'-B"d-B)! 
j=0 


the solution (10.4.37) can be written as 
6(n) = B"[6(0) — 4u? Pod — B)~!4) + 42 Po 0 — BB)! (10.4.42) 
When the eigenvalues of B are inside the unit circle, we have 
jim, 8(n) 4 0(c0) = 4? P,(I— B) 1 (10.4.43) 
because the first term converges to zero. Substituting (10.4.43) in (10.4.42), we obtain (10.4.41). 


Evolution of the mean square error 


We next express the MSE as a function of 4 and 0. Using (10.2.10) and (10.2.18), we 
have 


e(n) = y(n) — ce” (n — 1)x(n) = e,(n) — &4 (n — 1)x(n) (10.4.44) 


where e,(n) is the optimum filtering error and ¢(n) is the coefficient error vector. The (a 
priori) MSE of the adaptive filter at time n is 


P(n) = E{le(n)/?} 
= E{\eo(n)|?} — Efe" (n — I)x(n)e%(n)} — Efeo(n)x" (nen — 1} (10.4.45) 
+ E{e" (n — 1)x(n)x (n)e(n — 1)} 


Since ¢(7) is a random vector, the evaluation of the MSE (10.4.45) requires the correlation 
between x(n) and ¢(n — 1). Using the independence assumptions Al and A2, we see that 
the second and third terms in (10.4.45) become zero, as explained before, and the excess 
MSE is given by the last term 


Pex(n) = Efe" (n — 1)x(n)x" (nye(n — 1)} (10.4.46) 
If we define the quantities 
AS &" (n — 1) and B = x(n)x" (n)é(n — 1) (10.4.47) 
and notice that AB = tr(AB) (because AB is a scalar) and tr(AB) = tr(BA), we obtain 
Pex(n) = E{tr(AB)} = E{tr(BA)} = tr(E{BA}) 
= tr(E{x(n)x" (n)} E{e(n — 1I)€" (n — 1)}) 


because expectation is a linear operation and x(n) and ¢(n — 1) have been assumed statis- 
tically independent. Therefore, the excess MSE can be expressed as 


Pex(n) = tr[R®(n — 1)] (10.4.48) 


where ®(n) = E{é(n)é" (n)} is the correlation matrix of the coefficient error vector. This 
expression simplifies to 


Pex(n) = Mo%o2 (10.4.49) 
if R = 071 and ®(n) = 071. 
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If R and ®(7) are both positive definite, relation (10.4.48) shows that Pex(n) > 0, 
that is, the MSE attained by the adaptive filter is larger than the optimum MSE P, of the 
optimum filter (cost of adaptation). 

Next we develop a difference equation for Px (1), using, for convenience, the principal 
coordinate system of the input correlation matrix R. Since the trace of a matrix remains 
invariant under an orthogonal transformation, we have 


P.x(n) = tr[R®(n)] = tr[A@(n)] = 47 O(n) (10.4.50) 


where the elements of A are the eigenvalues of R and the elements of 6(7) are the diagonal 
elements of O(n). 

Since the most often observable and important quantity for the operation of an adaptive 
filter is the MSE, we use our previous results to determine the value of MSE as a function of 
n, that is, the learning curve of the LMS adaptive filter. To this end, we use the orthogonal 
decomposition B = Q(B) A(B)Q/ (B) to express B” as 


M 
B" = QB)A"(B)Q” (B) = D1 a7 B)qr Bai! B) (10.451) 
k=1 


where 1;(B) are the eigenvalues and q;(B) are the eigenvectors of matrix B. Substituting 
(10.4.41) and (10.4.51) into (10.4.50) and recalling that P(n) = Py + Pex(n), we obtain 


P(n) = Po + Prr(n) + Pex (00) (10.4.52) 


where P-x (oo) is termed the steady-state excess MSE and 


M 
P(n) & s vy, (R, B) 27 (B) (10.4.53) 
k=1 


is termed the transient MSE because it dies out exponentially when 0 < A;,(B) < 1,1 < 
k < M. The constants 


yx (R, B) = 27 (R)qx Bai’ (B)[8 (0) — 0 (00) (10.4.54) 


are determined by the eigenvalues 4, (R) of matrix R and the eigenvectors q, (B) of matrix 
B. Since the minimum MSE P, is available, we need to determine the steady-state excess 
MSE Pex (co). 


PROPERTY 10.4.3. When the LMS adaptive algorithm converges, the steady-state excess MSE 


is given by 
C(w) 
P, = Pj» ———— 10.4.55 
ex (OO) oF =COn ( ) 
’ HAk 
A 
where C(u) = » =o (10.4.56) 


and A, are the eigenvalues of the input correlation matrix. 
Proof. Using (10.4.32) and (10.4.35), we obtain the difference equation 
O4(n) = peOK(n — 1) + 47Ag Pox(n — 1) + 4p Podg (10.4.57) 
When (10.4.40) holds, (10.4.57) attains the following steady-state form 
94 (00) = pK (00) + 417 Ak Pex (00) + 47 Podk 
whose solution, in conjunction with (10.4.36), gives 


Po + Pex(0o) 


Ox (00) = wb 1— 2h 


M M 
ny 
and Pex(00) = > A404(00) = [Po + Pex(00)] ome 
k=1 k=1 
Solving the last equation for Pex (oo), we obtain (10.4.55) and (10.4.56). 


Solving (10.4.55) for C(tz) gives 


Cj (10.4.58) 
Po + Pex (00) 
which implies that 0<C(p) <1 (10.4.59) 


because P, and P.x (oo) are positive quantities. It has been shown that (10.4.59) leads to 
the tighter bound 0 < 24 < 2/(3trR) for the adaptation step 4. (Horowitz and Senne 
1981; Feuer and Weinstein 1985). Therefore, convergence in the MSE imposes a stronger 
constraint on the step size jz than does (10.4.40), which ensures convergence in the mean. 


10.4.3 Summary and Design Guidelines 


There are many theoretical and simulation analyses of the LMS adaptive algorithm under a 
variety of assumptions. In this book, we have focused on results that help us to understand 
its operation and performance and to develop design guidelines for its practical application. 
The operation and performance of the LMS adaptive filter are determined by its stability 
and the properties of its learning curve, which shows the evolution of the MSE as a function 
of time. The MSE produced by the LMS adaptive algorithm consists of three components 
[see (10.4.52)] 


P(n) = Po + Py(n) + Pex (co) 


where P, is the optimum MSE, Pi;(7) is the transient MSE, and P.x(0o) is the steady- 
state excess MSE. This equation provides the basis for understanding and evaluating the 
operation of the LMS adaptive algorithm in a stationary SOE. For convenience, the LMS 
adaptive filtering algorithm is summarized in Table 10.3. 


TABLE 10.3 
Summary of the LMS algorithm. 


Design parameters 


x(n) = input data vector at time n 

y(n) = desired response at time n 

c(n) = filter coefficient vector at time n 
M = number of coefficients 


[4 = step-size parameter 


O<nh< 7, 
p> EX\xg(n)|7} 


Initialization 


c(-1) = x(-1) = 0 


Computation 


For n = 0, 1, 2,..., compute 
5(n) =e! (n — 1) x(n) 
e(n) = y(n) — $(n) 
e(n) = e(n — 1) + 2px(nje*(n) 
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Stability. The LMS adaptive filter converges in the mean-square sense, that is, the 
transient MSE dies out, if the adaptation step ju satisfies the condition 


K 
0<2 = 10.4.60 
Pog trR ( ) 


where trR is the trace of the input correlation matrix and K is aconstant that depends weakly 
on the statistics of the input data vector. For example, when x(n) ~ A’(0, R), we proved 
that K = | or 3. In addition, this condition ensures that on average the LMS adaptive 
filter converges to the optimum filter. We stress that in most practical applications, where 
the independence assumption does not hold, the step size jz should be much smaller than 
K/trR. Therefore, the exact value of K is not important in practice. 


Rate of convergence. The transient MSE dies out exponentially without exhibiting any 
oscillations. This follows from (10.4.53) because when ju satisfies (10.4.40), the eigenvalues 
of matrix B are positive and less than 1. The settling time, that is, the time taken for the 
transients to die out, is proportional to the average time constant 


1 
Tlms,av = —— (10.4.61) 
[day 
where Agy = Ole 1 4«)/M is the average eigenvalue of R (Widrow et al. 1976). The 
quantity pot = )°. 9 Pir(n), which provides the total transient MSE, can be used as a 
measure for the speed of adaptation. When uA, < 1 (see Problem 10.12), we have 


oo M 
Pot Sy PG) = i Y> AG; (0) (10.4.62) 
n=0 k=1 
where A@;(0) is the initial distance of a coefficient from its optimum setting measured in 
principal coordinates. As is intuitively expected, the smaller the step size and the farther 
the initial coefficients are from their optimum settings, the more iterations it takes for the 
LMS algorithm to converge. Furthermore, from the discussion in Section 10.3, it follows 
that the LMS algorithm will converge faster if the contours of the error surface are circles, 

that is, when the input correlation matrix is R = 021. 


Steady-state excess MSE. The excess MSE after the adaptation has been completed 
(i.e., the steady-state value) is given by (10.4.55). When wAx, « 1, we may approximate 
(10.4.55) as follows 


uw tR 
1—ptrR 
which allows a much easier interpretation. Solving for uz trR, we obtain wtrR ~ Pex (o)/ 


[ Pex (00) + Po] which implies that 0 < wtrR < 1. Since uw trR < 1, we often use the 
approximation 


Pex(00) & Po 


Pex (Oo) ~ wPottR (10.4.63) 


which implies that Pex (co) « Po, that is, for small values of the step size the excess MSE is 
much smaller than the optimum MSE. Note that the presence of the irreducible error e,(n) 
prevents perfect adaptation as n — oo because P, > 0. 


Speed versus quality of adaptation. From the previous discussion we see that there 
is a tradeoff between rate of convergence (speed of adaptation) and steady-state excess 
MSE (quality of adaptation, or accuracy of the adaptive filter). The first requirement for an 
adaptive filter is stability, which is ensured by choosing ju to satisfy (10.4.60). Within this 
range, decreasing jz to reduce the desired level of misadjustment, according to (10.4.63), 


decreases the speed of convergence; see (10.4.62). Conversely, if jz is increased to increase 
the speed of convergence; this results in an increase in misadjustment. This tradeoff between 
speed of convergence and misadjustment is a fundamental feature of the LMS algorithm. 


FIR filters. In this case, the input is a stationary process x(n) with a Toeplitz correlation 
matrix R. Therefore, we have 


trR = Mr(0) = ME{|x(n)|7} = MP, (10.4.64) 

where MP, is called the tap input power. Substituting (10.4.40) into (10.4.64), we obtain 
1 

0<2u< (10.4.65) 


MP, = tap input power 
which shows that the selection of the step size depends on the input power. Using (10.4.63) 
and (10.4.64), we see that misadjustment M is given by 


M = — ~ wMP; (10.4.66) 


oO 
which shows that for given M and P,,. the value of misadjustment is proportional to jz. We 
emphasize that the misadjustment provides a measure of how close an LMS adaptive filter 
is to the corresponding optimum filter. 
The statistical properties of the SOE, that is, the correlation of the input signal and 
the cross-correlation between input and desired response signals, play a key role in the 
performance of the LMS adaptive filter. 


First, we should make sure that the relation between x(n) and y(n) can be accurately 
modeled by a linear FIR filter with M coefficients. Inadequacy of the FIR structure, 
output observation noise, or lack of correlation between x(n) and y(n) increases the 
magnitude of the irreducible error. If M is very large, we may want to use a pole-zero 
IIR filter (Shynk 1989; Treichler et al. 1987). If the relationship between x(n) and y(n) 
is nonlinear, we certainly need a nonlinear filtering structure (Mathews 1991). 

The LMS algorithm uses a “noisy” instantaneous estimate of the gradient vector. How- 
ever, when the correlation between input and desired response is weak, the algorithm 
should make more cautious steps (“wait and average”). Such algorithms update their co- 
efficients every L samples, using all samples between successive updatings to determine 
the gradient (gradient averaging). 

The eigenvalue structure of R as measured by its eigenvalue spread (Amax/Amin) OF 
equivalently by the spectral flatness measure (SFM ) (see Section 4.1) has a strong effect 
on the rate of convergence of the LMS algorithm. In general, the rate of convergence 
decreases as the eigenvalue spread increases, that is, as the contours of the cost function 
become more elliptical, or equivalently the input spectrum becomes more nonwhite. 


Normalized LMS algorithm. According to (10.4.60), the selection of jz in practical 
applications is complicated because the power of the input signal either is unknown or 
varies with time. This problem can be addressed by using the normalized LMS (NLMS) 
algorithm [see (10.4.11)] 


jh 
Ey(n) 
where Ey(n) = ||x(n)||? and 0 < ji < 1. It can be shown that the NLMS algorithm 
converges in the mean square if 0 < jt < 1 (Rupp 1993; Slock 1993), which makes the 
selection of the step size (4 much easier than the selection of jz in the LMS algorithm. 

For FIR filters, the quantity Eyy(n) provides an estimate of ME {|x(n)|7} and can be 
computed recursively by using the sliding-window formula 


Eu(n) = Eu(n —1) + |x()/? — |x — MP (10.4.68) 


e(n) = e(n — 1) + x(n)e*(n) (10.4.67) 
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where Ey (—1) = 0 or a first-order recursive filter estimator. In practice, to avoid division 
by zero, if x(n) = 0, we set Ey(n) = 5+ IIx(n) ||, where 6 is a small positive constant. 


Other approaches and analyses. The analysis of the LMS algorithm presented in this 
section is simple, clarifies its performance, and provides useful design guidelines. However, 
there are many other approaches, which are beyond the scope of this book, that differ in 
terms of complexity, accuracy, and objectives. Major efforts to remove the independence 
assumption and replace it with the more realistic statistically dependent input assumption 
are documented in Macchi (1995), Solo (1997), and Butterweck (1995) and the references 
therein. Convergence analysis of the LMS algorithm using the stochastic approximation 
approach and a deterministic approach using the method of ordinary differential equations 
are discussed in Solo and Kong (1995), Sethares (1993), and Benveniste et al. (1987). Other 
types of analyses deal with the determination of the probability densities and the probability 
of large excursions of the adaptive filter coefficients for various types of input signals (Rupp 
1995). The analysis of the convergence properties of the LMS algorithm and its variations 
is still an active area of research, and new results appear continuously. 


10.4.4 Applications of the LMS Algorithm 


We now discuss three practical applications in which the LMS algorithm has made a sig- 
nificant impact. In the first case, we consider the previously discussed linear prediction 
problem and compare the performance of the LMS algorithm with that of the SDA. Table 
10.4 provides a summary of the key differences between the SDA and the LMS algorithms. 
In the second case, we study echo cancelation in full-duplex data transmission, which em- 
ploys the LMS algorithm in its implementation. In the third case, we discuss the application 
of adaptive equalization, which is used to minimize intersymbol interference (ISD in a 
dispersive channel environment. 


TABLE 10.4 
Comparison between the SDA and LMS algorithms. 


SDA LMS 
Deterministic algorithm: Stochastic algorithm: 

lim e(n) = ¢o lim E{e(n)} = eo 
noo nc 
If converges, it terminates to ¢g If converges, it fluctuates about ¢y 

The size of fluctuations is proportional to jz 

Noiseless gradient estimate Noisy gradient estimate 
Deterministic steps Random steps 


We can only compare the ensemble average behavior of LMS with the SDA. 


Linear prediction 


In Example 10.3.1, the AR(2) model given in (10.3.25) was considered, and the SDA 
was used to determine the corresponding linear predictor coefficients. We also analyzed the 
performance of the SDA. In the following example, we perform a similar acquisition of 
predictor coefficients using the LMS algorithm, and we study the effects of the eigenvalue 
spread of the input correlation matrix on the convergence of the LMS adaptive algorithm 
when it is used to update the coefficients. 


EXAMPLE 10.4.1. The second-order system in (10.3.25) is repeated here, which generates the 
signal x(n): 
x(n) +a,x(n — 1) +anx(n — 2) = w(n) 


P(n) 


where w(n) ~ WGN(O, 0°.) and where the coefficients are selected from Table 10.2 for two 
different eigenvalue spreads. A Gaussian pseudorandom number generator was used to obtain 
1000 realizations of x(n) using each set of parameter values given in Table 10.2. These sample 
realizations were used for statistical analysis. 

The second-order LMS adaptive predictor with coefficients e(n) = [cy (n) c2 (n)]" is given 
by [see (10.4.12)] 


e(n) = x(n) — cy (n — 1I)x(n — 1) — co(n — 2)x(n — 2) n>0 
c1(n) = cy (n — 1) + 2pe(n)x(n — 1) 
co(n) = cr(n — 1) + 2me(n)x(n — 2) 


where pL is the step-size parameter. The adaptive predictor was initialized by setting x(—1) = 
x(—2) = 0 and cy(—1) = c2(—1) = O. The above adaptive predictor was implemented with 
jt = 0.04, and the predictor coefficients as well as the MSE were recorded for each realization. 
These quantities were averaged to study the behavior of the LMS algorithm. These calculations 
were repeated for 4p = 0.01. 

In Figure 10.20 we show several plots obtained for 4 (R) = 1.22. In plot (a) we show the 
ensemble averaged trajectory {e(n)}150) superimposed on the MSE contours. A trajectory of a 
simple realization is also shown to illustrate its randomness. In plot (b) the e(n) learning curve 
for the averaged value as well as for one single realization is shown. In plot (c) the corresponding 
learning curves for the MSE are depicted. Finally, in plot (d) we show the effect of step size ju 
on the MSE learning curve. Similar plots are shown in Figure 10.21 for ¥(R) = 10. 
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FIGURE 10.20 
Performance curves for the LMS used in the linear prediction problem with step-size 
parameter jz = 0.04 and eigenvalue spread ¥(R) = 1.22. 
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FIGURE 10.21 


Performance curves for the LMS used in the linear prediction problem with step-size 
parameter 44 = 0.04 and eigenvalue spread V(R) = 10. 


Several observations can be made from these plots: 


The trajectories and the learning curves for a simple realization are clearly random or “noisy,” 

while the averaging over the ensemble clearly has a smoothing effect. 

¢ The averaged quantities (coefficients and the MSE) converge to the true values, and this 
convergence rate is in accordance with theory. 

¢ The rate of convergence of the LMS algorithm depends on the step size jz. The smaller the 

step size, the slower the rate. 

The rate of convergence also depends on the eigenvalue spread + (R). The larger the spread, 

the slower the rate. For 4(R) = 1.22, the algorithm converges in about 150 steps while for 

X(R) = 10 it requires about 500 steps. 


Clearly these observations compare well with the theory. 


Echo cancelation in full-duplex data transmission 


Figure 10.22 illustrates a system that achieves simultaneous data transmission in both 
directions (full-duplex) over two-wire circuits using the special two-wire to four-wire inter- 
faces (called hybrid couplers) that exist in any telephone set. Although the hybrid couplers 
are designed to provide perfect isolation between transmitters and receivers, this is not the 
case in practical systems. As a result, (1) one part of the transmitted signal leaks through 
the near-end hybrid to its own receiver (near-end echo), and (2) another part is reflected 
by the far-end hybrid and ends up at its own receiver (far-end echo). The combined echo 
signal, which can be 30 dB stronger than the signal received from the other end, increases 
the number of errors. We note that in contrast with acoustic echo cancelation, the delay of 
echoes in data transmission is immaterial. 
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FIGURE 10.22 
Model of a full-duplex data transmission system that uses an echo canceler 
in the modems. 


The best way to address this problem is to form a replica of the echo and then subtract it 
from the incoming signal. We can model the echoes as the result of an “echo” path between 
the transmitter and the receiver. For baseband data transmission this echo path is basically 
linear and varies very slowly with time. Therefore, we can obtain a replica of the echo signal 
using an FIR LMS adaptive filter (echo canceler), as shown in Figure 10.22. The inclusion 
of the transmitter in the echo path, as long as it involves linear operations, simplifies the 
implementation and improves the speed of adaptation because the input is an IID binary 
data sequence of values +1 and —1 with equal probability (Verhoeckx et al. 1979). 

Referring to Figure 10.23, if we assume that the echo path has an FIR impulse response, 
the echo signal is given by 


y(n) = ¢) x(n) (10.4.69) 
where Co = [¢9(0) co(1) +++ Co(M — 1)" 


If g(n) is the impulse response of the transmission path from the far-end transmitter to the 
near-end receiver, the received signal is given by 


s(n) = y(n) + z(n) + v(m) © y(n) + un) (10.4.70) 


z(n) = D> g(k)s(n —k) 


k=0 


Data 
generator 
A 


Adaptive 


echo Noise 
canceler generator 


e(n— 1) 


Transmission Data 


path generator 
g(n) B 


FIGURE 10.23 
Block diagram of a system for investigating the performance of adaptive echo canceler. 
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where s (7) is the transmitted data signal and v(n) ~ WGN(O, o?) is additive noise. The sig- 
nal u(n) = z(n) + v(n) represents the “uncancelable” signal because it cannot be removed 
by the canceler. 

The LMS adaptive echo canceler is given by 


$(n) =e" (n — 1)x(n) (10.4.71) 
e(n) = y(n) — 3(n) (10.4.72) 
e(n) = e(n — 1) + 2pe(n)x(n) (10.4.73) 


where ju is the adaptation step size. The adaptive filter takes advantage of the fact that x(n) 
is correlated with y(7) but uncorrelated with s(m) and v(n). 
The residual (uncanceled) echo is 


e,(n) £ y(n) — $(n) = [ep — e(n — 1)]’ x(n) & —@ (0 — 1) x(n) (10.4.74) 
and if we assume that ¢(n — 1) and x(n) are independent, then 
P,(n) = Efer(n)} = E{é" (n — Len — 1} 


because R = E{x(n)x? (n)} = I. Using (10.4.69), (10.4.71), and (10.4.72), we can easily 
show that 


&(n) = @(n — 1) — 2ux(n)x? (n)e(n — 1) + 2ux(n)u(n) (10.4.75) 

If we premultiply (10.4.75) by its transpose and take the mathematical expectation, we 
obtain 

P.(n +1) = (1 — 4 + 47M) P,(n) +4Mo2 (10.4.76) 


using the independence assumption and the relation x! (n)x(n) = M. The solution of 
(10.4.76), in terms of the residual echo ratio P, (n)/o?, is 


P, P,(O M M 
(1) =u — Aw +4y2My" (0) LL LL 
o2 a2 1—uM 1—pM 


u u 


(10.4.77) 


and describes completely the operation of the LMS adaptive echo canceler. Indeed, we draw 
the following conclusions: 


1. The algorithm converges if 
1 
l—4u+4y?M| <1 or O<w< u (10.4.78) 


which agrees with (10.4.40) because tR = M. 
2. After convergence we have 


UM 9 2 
P,(oo) = ame ~ uMo? (10.4.79) 
which again is in agreement with (10.4.63). 
3. If P,(n)/o2 >> uM/(1 — uM), we have 
P. 
> * ~ (1-4 +4y2M)" (10.4.80) 
7 


which can be used to find out how many iterations are required for a given echo reduction. 
For example, we can easily show that to achieve a 20-dB echo reduction requires nag ~ 
1.15/, iterations. 


From the previous discussion, it should be clear that the step size jz plays a crucial 
role in the performance of the adaptive echo canceler because it determines both the rate of 
convergence and the minimum residual echo cancelation that can be attained. Furthermore, 
we clearly see the tradeoff between fast adaptation and residual echo power. 


, GB) 


PAnvlo 


EXAMPLE 10.4.2. Consider the system shown in Figure 10.23 for investigating the performance 
of the LMS algorithm in adaptive echo cancelation and to verify the above conclusions. The 
data generators A (in modem A) and B (in modem B) output symbols +1 or —1 with equal 
probability (i.e., Bernoulli sequence). The FIR filter following data generator A models the echo 
path, which is assumed to be 


coln) = —3(5)" + 3G)" On <M-1 


where M = 20 is the total length of the echo path. The filter following data generator B models 
the transmission path between the far-end transmitter and the near-end receiver, which we will 
assume to be 


gin) = $(3)" n=0 


The noise generator is a white Gaussian source with 2 = 1 and models the transmission noise. 
Using the equations oF = ye os (k) and o2 = a g2(k) + at we scale u(n) so that 
10 log (05 / o2) = 30dB. The adaptive echo canceler employs the LMS algorithm withe (0) = 0. 
We perform Monte Carlo simulations on this system. Figure 10.24 shows the residual echo ratio 
Py (n)/ aa evaluated by ensemble averaging over 200 independent trials of the experiment, 
for two different step sizes in the LMS algorithm [which satisfy (10.4.78)], superimposed on 
the corresponding theoretical curves computed by using (10.4.79) and (10.4.80). Clearly, the 
simulations support the theoretical results quite accurately. More detailed discussions of adaptive 
echo cancelation techniques for both baseband and passband data transmission systems can be 
found in Gitlin et al. (1992) and in Ling (1993a). 


Residual echo ratio 


0 200 400 600 800 1000 
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FIGURE 10.24 

Performance analysis of the LMS algorithm in the adaptive echo 
cancelation that clearly shows the tradeoff between rate of 
convergence and residual echo power. 


Adaptive equalization 


In Section 6.8, we discussed the theory and implementation of channel equalization in 


data transmission systems. When data are transmitted below 2400 bits/s, the IST is relatively 
small and does not pose a problem in the operation of a modem. However, for high-speed 
communication over 2400 bits/s, an equalizer is needed in the modem to compensate for the 
channel distortion. Since channel characteristics are generally unknown and time-varying, 
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an adaptive algorithm is required that leads to adaptive equalization. Figure 10.25 describes 
an application of adaptive filtering to adaptive channel equalization. Initially, coefficients of 
the equalizer are adjusted, by means of the LMS algorithm, by transmitting a known training 
sequence of short duration. After this short training period, the actual data sequence {y(n)} 
is transmitted. The slow variation in channel characteristics is then continuously tracked by 
adjusting coefficients of the equalizer, using the decisions in place of the known training 
sequence. This approach works well when decision errors are infrequent. 


Channel 


Equalizer 
Transmitter Receiver with 


detector 


Data Received 


sequence data 


(a) 


Adaptive oes 


a(n) 
equalizer oes 


Training 
data 


ne 
ne 


(b) 


FIGURE 10.25 
Model of an adaptive equalizer in a data transmission system. 


EXAMPLE 10.4.3. Figure 10.26 shows the block diagram of the system used in the experimental 
investigation of the performance of the LMS algorithm used in the adaptive equalizer. The data 
source generates Bernoulli sequence {y(1)} with symbols +1 and —1 having zero mean and unit 
variance. The channel following the source is modeled by the raised cosine impulse response 


0.5 {1 +008 | en - | n= 13,3 
h(n) = W (10.4.81) 


0 otherwise 


where parameter W is used to control the amount of channel distortion. The amount of channel 
distortion increases with W. The random noise generator outputs white Gaussian sequence v(7) 
which models the noise in the channel. The equalizer input is 


3 
x(n) = >. h({k)y(n —k) + v(n) (10.4.82) 
k=1 


Since y(7) is an independent sequence and since v(7) is uncorrelated with y(7), the maximum 
lag that produces nonzero correlation is 2. Thus the correlation of x(7) is given by 


rx (0) = h?(1) +:h?(2) +h?) +02 
ry (1) = AVA) + h(2)h(3) 
rx (2) = h(A)h(3) 


ya(n) = y(n — D) 


Adaptive 
equalizer 
c(n) 


Data Channel 
generator h(n) 
Noise 

generator 


FIGURE 10.26 
Block diagram of a system for investigating the performance of an adaptive 
equalizer. 


from which an M x M autocorrelation matrix R can be constructed for an equalizer of length M. 
Clearly, parameter W also controls the eigenvalues of R and hence the ratio V(R). The design 
of an MSE equalizer has been discussed in Example 6.8.1. Here we study the performance of 
the corresponding LMS adaptive equalizer. 

The training signal y(n) is delayed by an amount equal to the combined delay introduced 
by the channel and the equalizer for the desired signal. The impulse response h(n) in (10.4.81) 
is symmetric with respect to n = 2, and assuming that the equalizer is a linear-phase FIR filter, 
the total delay is equal to A = (M — 1)/2 + 2. The error signal e(n) = y(n — A) — y(n) is 
used along with x(n) to implement the LMS algorithm in the adaptive equalizer with c(0) = 0. 
We performed Monte Carlo simulations using 100 realizations of random sequences with M = 
11; A=7; o2 = 0.001; W = 2.9 and W = 3.5; and uw = 0.01, 0.04, and 0.08. The results are 
shown in Figures 10.27 and 10.28. 


Effect of eigenvalue spread. Performance plots of the LMS algorithm for W = 2.9 and 
W = 3.5 are shown in Figure 10.27. In plot (a) we depict MSE learning curves from which 
we observe that the convergence rate of the MSE decreases with W [or equivalently with in- 
crease in V(R)], which is to be expected. The steady-state error, on the other hand, increases 
with W. In plots (b) and (c) we show the ensemble averaged equalizer coefficients. Clearly, the 
responses are symmetric with respect to n = 5 as assumed. Also equalizer coefficients converge 
to different inverses due to changes in the channel characteristics. 


Effect of step size x. In Figure 10.28 we show the MSE learning curves obtained for W = 2.9 
and with three different step-size parameter values of 0.01, 0.04, and 0.08. It indicates that 
affects the rate of convergence as well as the steady-state value. For pp = 0.08, the algorithm 
converges in about 100 iterations but has higher steady-state value than the case for u = 0.04, 
which requires about 275 iterations for convergence. For 4 = 0.01 more than 500 iterations 
are needed. Finally, Figure 10.29 shows sample realizations of the transmitted, received, and 
equalized sequences using the discussed LMS equalizer. 


10.4.5 Some Practical Considerations 


The LMS is the most widely known and used adaptive algorithm because of its simplicity 
and robustness to disturbances and model errors. We next discuss some issues related to its 
robustness, finite-word-length effects, and implementation. 


Robustness 


If we assume the model in Figure 10.19, an adaptive filter is said to be robust if the 
effect of the disturbances {c(—1), eo(n)} on the resulting estimation errors {€(n), e(1)} (or 
{¢(n), €(n)}), as measured by their energy, is small (Sayed and Rupp 1998). Basically a 
robust adaptive filter should be insensitive to the initial conditions c(— 1) and the optimum 
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FIGURE 10.27 
Performance analysis curves of the LMS algorithm in the adaptive equalizer: 
pe = 0.04. 
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FIGURE 10.28 
MSE learning curves of the LMS algorithm in the adaptive equalizer: W = 2.9. 
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FIGURE 10.29 
Sample realizations of the transmitted, received, and equalized sequences using an 
FIR LMS equalizer. 


residual error e,(), which acts as measurement noise. These inputs are collectively called 
disturbances. In practice, ég(n) accounts not only for measurement noise but also for model 
mismatching, quantization errors, and other inaccuracies. 

If we define the energies of the disturbances and the estimation errors by 


1 . n 
Eaia(n) = SIDI + leat? (104.83) 
j=0 
and Bais 5 leon? Es = Kn) (104.84) 


it can be shown that the coefficient vectors determined by the LMS algorithm satisfy the 
condition 


Eerror(1) < Eagist(n) (10.4.85) 


assuming that 0 < 24 < 1/||x(n)||* (Sayed and Kailath 1994; Sayed and Rupp 1996). 
Equation (10.4.85) shows that the energy of the residuals is always upper-bounded by the 
energy of the disturbances, which explains the robust behavior of the LMS algorithm. 

Furthermore, it can be shown that the LMS algorithm minimizes the maximum possible 
difference between these two energies, over all disturbances with finite energy, and is 
optimum according to the H® (or minimax) criterion (Sayed and Rupp 1998; Hassibi et 
al. 1996). 
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Finite-precision effects 


When we design an LMS adaptive filter for a stationary SOE, we choose the step size 
LL to provide the desired balance between speed of convergence and misadjustment. If we 
are not concerned about fast convergence, we can reduce jz so much as to obtain practically 
insignificant misadjustment. However, in a digital implementation, the adaptation of the 
LMS algorithm stops (stalls) when the correction term becomes smaller in magnitude than 
one-half of the least significant bit (LSB), that is, 
: LSB 
|2jre"(n)x(n —k)| < a (10.4.86) 
Therefore, a decrease in jz may result in a performance degradation, unless we increase the 
number of bits (i.e., the precision) of the filter coefficients. If X;ms is the root mean square 
(rms) amplitude of the input signal, to a good approximation we have 


LSB 
4uXims 


where DRE is known as the digital residual error (Gitlin et al. 1973). We note that for a 
given number of bits the DRE increases as we decrease the step size ju. 

The roundoff numerical errors contribute to the steady-state EMSE a term that is in- 
versely proportional to 4, whereas the quantization of the input data and the filter output 
contributes a second term that is independent of the step size (Caraiscos and Liu 1984). 
Hence, in practice the step size of the LMS algorithm cannot be decreased below the level 
where the degradation effects of quantization and finite-precision arithmetic become sig- 
nificant. Also, the finite-precision effects become more pronounced as the ill conditioning 
of the input increases (Alexander 1987). 

When one or more eigenvalues of the input correlation matrix are zero, the correspond- 
ing adaptation modes either do not converge or may result in overflow due to nonlinear 
quantization effects (Gitlin et al. 1982). These effects can be prevented by using a tech- 
nique known as leakage. The leaky LMS algorithm is given by 


e(n) = (1 — yp)e(n — 1) + pe*(n)x(n) (10.4.88) 


where y is the leakage coefficient. Since jz and y are very small positive constants, 1 — yu 
is slightly less than 1. The updating (10.4.88) is obtained by minimizing the cost function 


P(n) = le(n)|* + ylle~™)|I? (10.4.89) 


le(n)| < £ DRE (10.4.87) 


which includes a penalty term proportional to the size of the coefficient vector. The price of 
leakage is an increase in computational complexity and some bias in the obtained estimates 
(see Problem 10.17). More details and practical applications of the leaky LMS algorithm 
to adaptive equalization are discussed in Gitlin et al. (1992, 1982). 

We can simplify the hardware implementation of LMS adaptive filters by using non- 
linearities to avoid the multiplications involved in the updating of the filter coefficients. 
These simplified LMS algorithms update the filter coefficients by using quantized correc- 
tion terms such as pz sign{e(n)}x(n—k), we(n)sign{x(n—k)}, or w sign{e(n)x(n—k)}; and 
their performance is degraded by the lower precision. Various signum-based LMS adaptive 
algorithms are discussed in Claasen and Mecklenbrauker (1981), Duttweiler (1982), and 
Treichler et al. (1987). 


Transform-domain and block LMS algorithms 


The LMS algorithm attains its best rate of convergence when the input correlation 
matrix is diagonal with equal eigenvalues. In the case of FIR filters, this implies that the 
input signal is white noise. When the components of the input data vector are correlated, we 
can improve the convergence by using an isotropic decorrelating transformation, as shown in 
Figure 10.30. The transformation matrix can be obtained by using either the triangular or the 


MxM 
Decorrelation 
matrix 


LMS 
algorithm 


FIGURE 10.30 
Transform domain LMS adaptive filter structure. 


orthogonal decomposition of the input correlation matrix as explained in Section 3.5. Since 
the innovations vector used by the LMS algorithm has uncorrelated components with unit 
variance, the error performance surface is a hypersphere, and the transform-domain LMS 
algorithm attains its best rate of convergence. In practice, when the input correlation matrix 
is unknown and possibly time-varying, we can only use suboptimum transforms such as the 
DFT, the discrete cosine transform (DCT), the discrete wavelet transform (DWT), or some 
other orthogonal transform. The performance of the obtained adaptive filter depends on the 
decorrelation properties of the transform, which in turn depends on the properties of the 
input correlation matrix. Another approach to overcome the problem of slow convergence 
for highly correlated inputs is found in the family of affine projection algorithms discussed in 
Ozeki and Umeda (1984), Rupp (1995), and Morgan and Kratzer (1996) and the references 
therein. 

In applications that require adaptive filters with a very large number of coefficients, 
real-time implementation of the LMS algorithm becomes quite involved. For example, 
acoustic echo cancelers with 8000 coefficients (500 ms sampled at 16 kHz) are typical for 
teleconference applications (Gilloire et al. 1996). The complexity of such applications can 
be reduced by using block adaptive filters (see Figure 10.31) that process one block of 
data at a time in either the time or the frequency domain. The adaptive filter coefficients 
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FIGURE 10.31 
Block adaptive filter structure. 


547 


SECTION 10.4 
Least-Mean-Square 
Adaptive Filters 


548 


CHAPTER 10 
Adaptive Filters 


are updated once per block and are kept fixed within the block. Such filters have good 
numerical accuracy, and can be easily pipelined and parallelized, and their complexity can 
be reduced by computing the involved convolutions and correlations using FFT algorithms. 
In some applications, such as acoustic echo cancelation, the block-length delay introduced 
by these filters may create problems. A detailed treatment of block and frequency-domain 
LMS algorithms is given in Shynk (1992), Gilloire et al. (1996), Haykin (1996), Jenkins 
and Marshall (1998), and Treichler et al. (1987). 

Another approach to reduce complexity and improve convergence is subband adaptive 
filtering, which splits the input signal and the desired response into smaller frequency 
bands (subbands), subsamples the resulting signals, processes each subband with different 
LMS filters, and finally interpolates and recombines the subbands to obtain the filter output 
(Shynk 1992; Gilloire and Vetterli 1992). The improved convergence results because the 
spectral dynamic range of each subband is smaller than that of the full band. However, 
the performance of subband adaptive filters is degraded by the cross-talk between adjacent 
subbands. 


10.5 RECURSIVE LEAST-SQUARES ADAPTIVE FILTERS 


In this section we use the method of LS to develop adaptive filters, we determine their 
rate of convergence and misadjustment, and we introduce the conventional recursive least- 
squares (CRLS) algorithm for their implementation. The CRLS algorithm does not impose 
any restrictions on the input data vector; therefore, it can be used for both array processing 
and FIR filtering applications. 


10.5.1 LS Adaptive Filters 


LS adaptive filters are designed so that the updating of their coefficients always attains the 
minimization of the total squared error from the time the filter initiated operation up to the 
current time. Therefore, the filter coefficients at time index n are chosen to minimize the 


cost function 
n n 


E(n) = ov e(NP = doy) — ex? (10.5.1) 
j=0 j=0 
where e(/) is the instantaneous error and the constant A,0 < A < 1, is the forgetting 
factor. Note that since the filter coefficients are held constant during the observation interval 
0 < j <x, thea priori and a posteriori errors are identical. The coefficient vector obtained 
by minimizing (10.5.1) is denoted by e() and provides the optimum LSE filter at time n. 
When A = 1, we say that the algorithm has growing memory because the values of the 
filter coefficients are a function of all the past input values. The forgetting factor (see Figure 
10.32) is used to ensure that data in the distant past are paid less attention (“forgotten’’) in 
order to provide the filter with tracking capability when it operates in a varying SOE (see 
Section 10.8). 
The filter coefficients that minimize the total squared error (10.5.1) are specified by the 
normal equations 


R(n)e(n) = d(n) (10.5.2) 

where Rin) * S r"-Ix(j)x" (jf) (10.5.3) 
j=0 

and din) = Soa"! x(jy*G) (10.5.4) 


j=0 


Exponential x 
forgetting 


> 


FIGURE 10.32 
Exponential weighting of observations at times n and n + 1. Older data are more 
heavily discounted by the algorithm. 


provide exponentially weighted estimates of the input correlation matrix and the cross- 
correlation vector between input and desired response due to the presence of A”~/ in the 
cost function (10.5.1). The minimum total squared error is 


Emin(n) = Ey(n) — d4 (n)e(n) (10.5.5) 
where Ey(n) 2 oar? (10.5.6) 
j=0 


is the energy of the weighted desired response signal. These formulas have been derived in 
Section 8.2.1. 

Suppose now that we wait for some n > M, where R(n) is usually nonsingular, we 
compute R(n) and d(n), and then we solve the normal equations (10.5.2) to determine the 
filter coefficients c(n). This approach, which is time-consuming, should be repeated with 
the arrival of new pairs of observations {x(7), y(7)}, that is, at times + 1,n+ 2, etc. 

A first reduction in computational complexity can be obtained by noticing that (10.5.3) 
can be expressed as 


R(n) = aR(n — 1) + x(n)x" (n) (10.5.7) 
which shows that the “new” correlation matrix R(n) can be updated by weighting the “old” 
correlation matrix R(n — 1) with the forgetting factor 4 and then incorporating the “new 
information” x(n)x (n). Since the outer product x(n)x" (n) is a matrix of rank 1, (10.5.7) 
provides a rank | modification of the correlation matrix. Similarly, using (10.5.4), we can 
show that 


d(n) = Ad(n — 1) + x(n)y*(n) (10.5.8) 


which provides a time update of the cross-correlation vector. 

We next show that using these two updatings, we can determine the new coefficient 
vector ¢() from the old coefficient vector c(m — 1) and the new observation pair {x(), y(n)} 
without solving the normal equations (10.5.2) from scratch. 


_ Apriori adaptive LS algorithm. If we solve (10.5.7) for R(n — 1) and (10.5.8) for 
d(m — 1) and use the normal equations (10.5.2), we have 


[R(n) — x(n)x" (n)Je(n — 1) = d(n) — x(n) y*(n) 
or after some simple manipulations 
R(n)e(n — 1) + x(n)e*(n) = d(n) (10.5.9) 
where e(n) = y(n) —c# (n — 1)x(n) (10.5.10) 


is the a priori estimation error. If the matrix R(n) is invertible, by multiplying both sides of 
(10.5.9) by R-!(n) and using (10.5.2), we obtain 


e(n — 1) + R7!(n)x(nye*(n) = R7!(n)d(n) = e(n) (10.5.11) 
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If we define the adaptation gain vector g(n) by 
R(n)g(n) £ x(n) (10.5.12) 
Equation (10.5.11) can be written as 
c(n) = e(n — 1) + g(nje*(n) (10.5.13) 
which shows how to update the old coefficient vector e(n — 1) to obtain the current vector 
c(n). 


EXAMPLE 10.5.1. It is instructive at this point to derive the LS adaptive filter with a single 
coefficient. Indeed, since for M = 1 the correlation matrix R(n) becomes the scalar E,(n), we 
obtain 


Ex(n) = AEx(n — 1) + le? 
e(n) = y(n) — c*(n — 1) x(n) 


c(n) =c(n—1)+ ine 


which is like an LMS algorithm with time-varying gain y(n) = 1/E x(n). However, the present 
algorithm is optimum in the LS sense. 


A posteriori adaptive LS algorithm. If we substitute (10.5.7) and (10.5.8) into the nor- 
mal equations (10.5.2), after some simple manipulations, we obtain 
AR(n — Le(n) — x(n)e*(n) = Ad(n — 1) (10.5.14) 
where e(n) = y(n) — ec” (n)x(n) (10.5.15) 
is the a posteriori estimation error. If the matrix R(n — 1) is invertible, (10.5.14) gives 


e(n) — A RoI = 2)x(nye*(n) = RO (n — 1) d(n — 1) = en — 1) 


or c(n) = e(n — 1) + B(n)e*(n) (10.5.16) 
where aR(n — 1)g(n) & x(n) (10.5.17) 


determines the alternative adaptation gain vector g(n). 
Since recursions (10.5.15) and (10.5.16) are coupled, the a posteriori algorithm is not 
applicable. However, if we substitute (10.5.16) into (10.5.15), we obtain 
e(n) = y(n) — [e# (n — 1) + e(n)g" (n))x(n) 
= e(n) — e(n)g" (n)x(n) 


or ope (10.5.18) 
a(n) 
where a(n) £14287 (n)x(n) = 14+47!x? RO! (n — I)x(n) (10.5.19) 


is known as the conversion factor. Hence, we can use (10.5.19) and (10.5.18) to compute 
the a posteriori error ¢(n) before we update the filter coefficient vector. This trick makes 
possible the realization and use of the a posteriori LS adaptive filter algorithm. If R(n-1) 
is positive definite, we have a(n) > 1 and |e()| < |e(7)| for all n. Therefore, 


Yi le@l? < dle? (10.5.20) 


which should be expected’ because the adaptive filter is designed by minimizing, at each 
time n, the total squared a posteriori error e(7). 


The computation of the quantity Yj=0 A-Jly(j) — cx)? for c = e(n), e(j), or e(j — 1) gives the block, a 
posteriori, or a priori total squared error. Clearly, only the block filter performs optimum LS filtering for all data 
in the interval 0 < j <n (see Problem 10.22). 


Also, from (10.5.13), (10.5.16), and (10.5.18) we obtain 
(10.5.21) 


which shows that the two adaptation gains have the same direction but different lengths. 
However, from (10.5.13) and (10.5.16) we see that the corrections g(n)e*(n) and g(n)e*(n) 
are equal. 

Another conversion factor, defined in terms of the gain vector g(7), is 


a(n) £1 —x" ()R7!(n)x(n) = 1 — x" (n)g(n) (10.5.22) 
and has some interesting interpretations. Using (10.5.21), we have 
H - 
a(n) = 1 BO) 
a(n) 
a(n)a(n) = a(n) +1 — [1 +x" (n)gn)] = 1 
or a(n) = = (10.5.23) 
a(n) 


which shows that the two conversion factors are inverses of each other. Since the input 
correlation matrix is nonnegative definite, that is, x” (n)R7!(n)x(n) > 0, (10.5.22) implies 


0 <a(n) <1 (10.5.24) 


that is, the conversion factor a(n) is bounded by 0 and 1. This bound allows the interpretation 
of a(n) as an angle variable (Lee et al. 1981), and its monitoring can provide information 
about the proper operation of RLS algorithms. Also the quantity 1 — a(n) can be interpreted 
as a likelihood variable (Lee et al. 1981). It can be shown (see Problem 10.23) that 


det R(n — 1) 
det R(n) 


which shows the importance of a (7) or a(n) for the invertibility for the estimated correlation 
matrix. 

The computational organization of the a priori and a posteriori LS adaptive algorithms 
is summarized in Table 10.5. 


a(n) =A (10.5.25) 


TABLE 10.5 
Summary of a priori and a posteriori LS adaptive filter approaches. 
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A priori LS adaptive filter 


A posteriori LS adaptive filter 


Correlation matrix R(n) = AR(n — 1) + x(n)x4 (n) 
Rim)g(n) = x(n) 
e(n) = y(n) — c# (n — 1)x(n) 


a(n) = 1—g" (n)x(n) 


Adaptation gain 
A priori error 


Conversion factor 
A posteriori error é(n) = a(n)e(n) 


Coefficient updating c(n) = e(n — 1) + g(n)e*(n) 


Rm) = AR(a — 1) +: x(n) x(n) 
AR(n — LR) = x(n) 

e(n) = y(n) — ce (n — 1)x(n) 
a(n) = 1+8" (n)x(n) 


ce(n) = e(n — 1) + B(n)e*(n) 


Figure 10.33 shows a block diagram representation of the a priori LS adaptive filter. 


There are two important points to be made: 


e The adaptation gain is strictly a function of the input signal. The desired response only 
affects the magnitude and sign of the coefficient correction term through the error. 
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Desired 
response 


Filtering 
3(n) = e4(n — 1)x(n) 


¢e(n) =e(n— 1) + g(n)e*(n) 
Coefficient updating 


R(n)g(n) = x(n) 
Gain vector 
computation 


Adaptive algorithm 


FIGURE 10.33 
Basic elements of the a priori LS adaptive filter. Note that the filtering process has no 
effect on the computation of the gain vector. 


e The most demanding computational task in RLS filtering is the computation of the adap- 
tation gain. This involves the solution of a linear system of equations, which requires 
O(M?) operations per time update. 


10.5.2 Conventional Recursive Least-Squares Algorithm 
The major computational load in LS adaptive filters, that is, the computation of the gain 
vectors 
g(n) = Ro! (n)x(n) (10.5.26) 
or 2(n) = 47 'R7!(n — 1)x(n) (10.5.27) 
can be reduced if we can find a recursive formula to update the inverse 
P(n) £R(n) (10.5.28) 


of the correlation matrix. We can develop such an updating by using the rank | updating 
(10.5.7) and the matrix inversion lemma 
(A Ro!x)(A7 ROE x)# 


AR +xx”%)-! =,7'R7! 
1+A71x#R-!x 


(10.5.29) 


discussed in Appendix A. 
Indeed, using (10.5.29), (10.5.7), (10.5.26), and (10.5.19), we can easily show that 


P(n) = 27'P(n — 1) — g(n)e" (n) (10.5.30) 


which provides the desired updating formula. Indeed, given the old matrix P(n — 1) and 
the new observations {x(7), y(7)} we compute the new matrix P(7), using the following 
procedure 


g(n) = A7'P(n — 1)x(n) 
a(n) = 1+ 8" (n)x(n) 
g(n) (10.5.31) 
a(n) 

P(n) = 47 'P(n — 1) — ging" (n) 
which is known as the conventional recursive LS (CRLS) algorithm. We again stress that 
the CRLS algorithm is valid for both linear combiners and FIR filters because it does 
not make any assumptions about the nature of the input data vector. However, for FIR 


filters we usually assume prewindowing, that is, x(—1) = 0, or equivalently x(n) = 0 for 
—-M<n<-l. 


g(n) = 


Updating of the minimum total squared error. We next derive an update recursion for 
the minimum total squared error (10.5.5). Using (10.5.6), we can easily see that 


Ey(n) = AEy(n — 1) + y(n)y*(n) (10.5.32) 


which provides a recursive updating for the energy of the desired response. Substituting 
(10.5.32) and (10.5.13) into (10.5.5), we obtain 


Emin(n) = AEy(n — 1) + y(n)y*(n) — d4 (nje(n — 1) — d” (n)g(n)e*(n) 
or by using (10.5.8) 
Emin(n) = AEy(n — 1) + y(n)y*(n) — d4 (n)g(nye* (1) 
— y(n)x# (n)e(n — 1) — Ad” (n — De(n — 1) 
Rearranging the terms of the last equation and using (10.5.5), we have 
Emin(n) = ALEy(n — 1) — 4 (n — Ye(n — 1) + Lyn) — d” (n)g(n)]e* (n) 
= Emin(n — 1) + {y(n) — [d4 (Ro!) IR) g(n) Je* (n) 
= Emin(n — 1) + Lyn) — €” (n)x(n)]e*(n) 


where the last equation is obtained because the matrix R(n) and its inverse are Hermitian. 
The last equation leads to 


Emin(n) = 4Emin(n — 1) + e(n)e*(n) (10.5.33) 
= AEmin(n — 1) + &(n)\e(n)|? (10.5.34) 
le(n)|* 


= VNEmin(n — 1) + 


EO) (10.5.35) 


which provide the desired updating formulas. Since the product ¢(n)e*(n) is by necessity 
real, we have e(n)e*(n) = e* (n)e(n). The value of Ein(1) increases with time and reaches 
a finite limit value only if A < 1. 


10.5.3 Some Practical Considerations 


In the practical implementation of CRLS adaptive filters, we have to deal with the issues 
of computational complexity, initialization, and finite-word-length effects. 


Computational complexity. The complete CRLS algorithm is summarized in Table 
10.6. A measure of the computational complexity of the CRLS algorithm is provided by 
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TABLE 10.6 

Practical implementation of the 
RLS algorithm. To update P(7), 
we only compute its upper (low) 
triangular part and determine 
the other part using Hermitian 
symmetry. 


Initialization 


c(-1)=0 P(-1)=87'I 
5 = small positive constant 


For each n = 0, 1, 2, ... compute: 


Adaptation gain computation 


&(n) = P(n — 1)x(n) 
a(n) = 2 +8! (n)x(n) 
£1 (n) 
ay (n) 
P(r) = 47! [Pin — 1) — gina? (n)] 


gin) = 


Filtering 


e(n) = y(n) — cl (n — 1)x(n) 


Coefficient updating 


e(n) = e(n — 1) + g(nye*(n) 


the number of operations (one operation consists of one multiplication and one addition) 
required to perform one updating. Since P(m) is Hermitian, it is possible to implement the 
algorithm so that it will require 2M? + 4M operations per time updating. The computation 
of &(n) and the updating of P(n) require O(M7) operations. In contrast, all remaining 
formulas, which involve dot products and vector-by-scalar multiplications, require O(M) 
operations. The inversion of the correlation matrix R(n) is essentially replaced by the scalar 
division used to compute g(7). 


Initialization. There are two ways to obtain the values P(—1) and ¢(—1) required to 
initialize the CRLS algorithm. The most obvious way is to collect an initial block of data 
{x(n), yM)}ong? ng > M, and then compute the exact inverse matrix P(—1) and the exact 
LS solution e(—1). 

The approach used in practice is to set P(—1) = 6~'I, where 6 is a very small positive 
number (on the order of 0.0102) and c(—1) = 0. For FIR filters this corresponds to setting 
x(—M +1) = V6 and x(n) = 0 for -M + 2 < n < —1. For any n > M, the normal 
equations matrix is 6A”I + R(n) and results in a biased estimate of c(7). However, for large 
n the choice of 6 is unimportant because the algorithm has exponentially forgetting memory 
forr <1. 

It can be shown (see Problem 10.24) that this approach provides a set of coefficients 
that minimizes the modified cost function 


n 
E(n) = 82" ell? + DA" ly) — ex)? (10.5.36) 
j=0 
instead of (10.5.1). This approach amounts to regularization of the LS solution (see Section 
8.7.3) and is further discussed in Hubing and Alexander (1991). Note that if we turn off the 
input, that is, we set x(n) = 0, then (10.5.30) becomes P(n) = 47! P(n — 1), which is an 
unstable recursion when A < 1. 


Finite-word-length effects. There are different RLS algorithms that are algebraically 
equivalent; that is, they solve the same set of normal equations. Therefore, they have the 
same rate of convergence and the same insensitivity to variations in the eigenvalue spread of 
the input correlation matrix with the CRLS algorithm. All RLS algorithms are obtained by 
exploiting exact mathematical relations between various algorithmic quantities to obtain 
better computational or numerical properties. Many of these algorithmic quantities have 
certain physical meanings or theoretical properties. For example, in the CRLS algorithm, 
the matrix P(7) is Hermitian and positive definite, the angle variable satisfies O < a(n) < 1, 
and energy E(n) should be always positive. However, when we use finite precision, some 
of these exact relations, properties, or acceptable ranges for certain algorithmic variables 
may be violated. 

The numerical instability of RLS algorithms can be traced to such forms of numerical 
inconsistencies (Verhaegen 1989; Yang and Bohme 1992; Haykin 1996). The crucial part 
of the CRLS algorithm is the updating of the inverse correlation matrix P(7) via (10.5.30). 
The CRLS algorithm becomes numerically unstable when the matrix P(7) = R7! (n) loses 
its Hermitian symmetry or its positive definiteness (Verhaegen 1989). In practice, we can 
preserve the Hermitian symmetry of P(7) by computing only its lower (or upper) triangular 
part, using (10.5.30), and then filling the other part, using the relation pjj(7) = Pi (n). 
Another approach is to replace P(n) by [P(n) + P” (n)]/2 after updating from P(n — 1) to 
P(n). 

It has been shown that the CRLS algorithm is numerically stable for A < 1 and diverges 
for A = 1 (Ljung and Ljung 1985). 


10.5.4 Convergence and Performance Analysis 


The purpose of any LS adaptive filter, in a stationary SOE, is to identify the optimum filter 
¢, = R7!d from observations of the input vector x(n) and the desired response 


y(n) = ¢ x(n) + eo(n) (10.5.37) 


To simplify the analysis we adopt the independence assumptions discussed in Section 10.4.2. 
The results of the subsequent analysis hold for any LS adaptive filter implemented using the 
CRLS method or any other algebraically equivalent algorithm. We derive separate results 
for the growing memory and the fading memory (exponential forgetting) algorithms. 


Growing memory (A = 1) 


In this case all the values of the error signal, from the time the filter starts its operation 
to the present, have the same influence on the cost function. As a result, the filter loses its 
tracking ability, which is not important if the filter is used in a stationary SOE. 


Convergence in the mean. For n > M the coefficient vector c(n) is identical to the 
block LS solution discussed in Section 8.2.2. Therefore 


E{c(n)} = Cy forn > M (10.5.38) 
that is, the RLS algorithm converges in the mean for n > M, where M is the number of 


coefficients. 


Mean square deviation. For n > M we have 
O(n) = oF E{R|(n)} (10.5.39) 


because c(m) is an exact LS estimate (see Section 8.2.2). The correlation matrix R(n) is 
described by a complex Wishart distribution, and the expectation of its inverse is 


. 1 
E{R-!(~)} =—_ Rn > M (10.5.40) 
n—-M 
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as shown in Muirhead (1982) and Haykin (1996). Hence 


(n) = 


O73 R-! 
n>M (10.5.41) 
M 


n—- 


and the MSD is 


o2 i 
Dn) = tl &(n)] = - =i ey 


i=1 


1 
— n>M (10.5.42) 
Xi 


where A;, the eigenvalues of R, should not be confused with the forgetting factor A. From 


(10.5.42) we conclude that (1) the MSD is magnified by the smallest eigenvalue of R and 
(2) the MSD decays almost linearly with time. 


A priori excess MSE. We now focus on the a priori LS algorithm because it is widely 
used in practice and to facilitate a fairer comparison with the (a priori) LMS algorithm. To 
this end, we note that the a priori excess MSE formula (10.4.48) 

Pex(n) = t[R®(n — 1)] (10.5.43) 
derived in Section 10.4.2, under the independence assumption, holds for any a priori adaptive 
algorithm. Hence, substituting (10.5.41) into (10.5.43), we obtain 

M 2 
P.x(n) => vee Y Re ha n>M (10.5.44) 


which shows that P.x(n) tends to zero as n > od. 


Exponentially decaying memory (0 < A < 1) 


In this case the most recent values of the observations have greater influence on the 
formation of the LS estimate of the filter coefficients. The memory of the filter, that is, 
the effective number of samples used to form the various estimates, is about 1/(1 — 4) for 
0.95 < X < 1 (see Section 10.8). 


Convergence in the mean. We start by multipying both sides of (10.5.11) by R(n), 
and then we use (10.5.7) and (10.5.10) to obtain 


R(n)e(n) = AR(n — Le(n — 1) + x(n)y*(n) (10.5.45) 
If we multiply (10.5.7) by c, and subtract the resulting equation from (10.5.45), we get 
R(n)e(n) = AR(n — NE(n — 1) + x(n)e*(n) (10.5.46) 


where ¢(n) = c(n) — Cy is the coefficient error vector. Solving (10.5.46) by recursion, we 
obtain 


n 
G(n) = A"R-! (n)R(COVEO) + R-! (cn) De rw Ix(jyex(j) (10.5.47) 
j=0 
which depends on the initial conditions and the optimum error e,(n). If we assume that 
R(n), x(j), and e,(j) are independent and we take the expectation of (10.5.47), we obtain 


E{&(n)} = 62” E{R7!(n)}&(0) (10.5.48) 


where, as usual, we have set R(O) = 6I, 6 > 0. If the matrix R(n) is positive definite and 
0 <2 <1, then the mean vector E{¢(n)} > 0asn — oo. Hence, the RLS algorithm with 
exponential forgetting converges asymptotically in the mean to the optimum filter. 


Mean square deviation. Using (10.5.46), we obtain the following difference equation 
for the coefficient error vector 


En) = ART (RQ — DEM — 1) + RM) x(n)e* (n) 


or E(n) ~ A€(n — 1) + R71) x(n)e* (n) 


because R-()R(n — 1) ~I for large n. If we neglect the dependence among ¢(n — 1), 
R(n), x(n), and e,(n), we have 


®(n) ~ O(n — 1) + o 2E{R7!(n)x(n)x" (n)R7(n)} (10.5.49) 


where o2 = E{|e(n)|*}. 
To make the analysis mathematically tractable, we need an approximation for the 
inverse matrix R~!(n). To this end, using (10.5.3), we have 
= — antl 1 


E{R(n)} = )0 Aa" JE{x(n)x"™ (n)} = : 7a F “TF (10.5.50) 


j=0 


where the last approximation holds for n > 1. If we use the approximation E {R(n)} = 
R(n), we obtain 


Ro'(n)~ 1 —-dAR! (10.5.51) 


which is more rigorously justified in Eleftheriou and Falconer (1986). Using the last ap- 
proximation, (10.5.50) becomes 


®(n) ~ O(n — 1) + (1 —A)?o?R! (10.5.52) 
which converges because 47 < 1. At steady state we have 
(1 — 47) @(00) ~ (1 — A)?o?2R7! 
because ®(n) ~ ®(n — 1) forn > 1. Hence 


(co) ~ ~ 4 Ro! (10.5.53) 
14+. ° 
M 
and therefore D,(00) = tr[®(oo)] = bs ae > a (10.5.54) 
ta 2G = 


which in contrast to (10.5.42) does not converge to zero as n — oo. This is explained by 
noticing that when A < 1, the RLS algorithm has finite memory and does not use effectively 
all the data to form its estimate. 


Steady-state a priori excess MSE. From (10.5.43) and (10.5.53) we obtain 


1-2 
P = tr[R®(co)] ~ ——Mo? 
ex (00) = t{R@(co)] ~ > Mos 
which shows that as a result of finite memory, there is a steady-state excess MSE that 
decreases as i approaches 1, that is, as the effective memory of the algorithm increases. 


(10.5.55) 


Summary 


The results of the above analysis are summarized in Table 10.7 for easy reference. 
We stress at this point that all RLS algorithms, independent of their implementation, have 
the same performance, assuming that we use sufficient numerical precision (e.g., double- 
precision floating-point arithmetic). Sometimes, RLS algorithms are said to have optimum 
learning because at every time instant they minimize the weighted error energy from the start 
of the operation (Tsypkin 1973). These properties are illustrated in the following example. 


EXAMPLE 10.5.2. Consider the adaptive equalizer of Example 10.4.3 shown in block diagram 
form in Figure 10.26. In this example, we replace the LMS block in Figure 10.26 by the RLS 
block, and we study the performance of the RLS algorithm and compare it with that of the LMS 
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TABLE 10.7 
Summary of RLS and LMS performance in a stationary SOE. 


Growing memory Exponential memory 
Property RLS algorithm RLS algorithm LMS algorithm 
Convergence in the mean For alln > M Asymptotically for Asymptotically for 
n—> oo n— oo 
Convergence in MS Independent of the Independent of the Depends on the 
eigenvalue spread eigenvalue spread eigenvalue spread 
Mo3 1-a, 5 7 
Excess MSE Pex(n) = oT er] ->0 Pex(0o) = fan? Pex() ~ pos trR 


algorithm. The input data source is a Bernoulli sequence { y(n)} with symbols +1 and —1 having 
zero mean and unit variance. The channel impulse response is a raised cosine 


05{1+00s| Fn —2)|| n=1,2,3 
h(n) = W (10.5.56) 


0 otherwise 


where the parameter W controls the amount of channel distortion [or the eigenvalue spread ¥ (R) 
produced by the channel]. The channel noise sequence v(n) is white Gaussian with o = 0.001. 
The adaptive equalizer has M = 11 coefficients, and the input signal y() is delayed by A = 7 
samples. The error signal e(n) = y(n — A) — ¥(n) is used along with x(n) to implement the 
RLS algorithm given in Table 10.6 with c(0) = 0 and 6 = 0.001. We performed Monte Carlo 
simulations on 100 realizations of random sequences with W = 2.9 and W = 3.5, andA = 1 
and 0.8. The results are shown in Figures 10.34 and 10.35. 


Effect of eigenvalue spread. Performance plots of the RLS algorithm for W = 2.9 and W = 
3.5 are shown in Figure 10.34. In plot (a) we depict MSE learning curves along with the steady- 
state (or minimum) error. We observe that the MSE convergence rate of the RLS, unlike that for 
the LMS, does not change with W [or equivalently with change in V(R)]. The steady-state error, 
on the other hand, increases with W. The important difference between the two algorithms is that 
the convergence rate is faster for the RLS (compare Figures 10.34 and 10.27). Clearly, this faster 
convergence of the RLS algorithm is achieved by an increase in computational complexity. In 
plots (b) and (c) we show the ensemble averaged equalizer coefficients. Clearly, the responses are 
symmetric with respect to n = 5, as assumed. Also equalizer coefficients converge to different 
inverses due to changes in the channel characteristics. 


Effect of forgetting factor 1. In Figure 10.35 we show the MSE learning curves obtained 
for W = 2.9 and with two different factors of 1 and 0.8. For A = 1, as explained before, the 
algorithm has infinite memory and hence the steady-state excess MSE is zero. This fact can be 
verified in the plot for 4 = 1 in which the MSE converges to the minimum error. For A = 0.8, 
the effective memory is 1/(1 — 4) = 5, which clearly is inadequate for the accurate estimation 
of the required statistics, resulting in increased excess MSE. Therefore, the algorithm should 
produce a nonzero excess MSE. This fact can be observed from the plot for A = 0.8. 

There are two practical issues regarding the RLS algorithm that need an explanation. The 
first issue relates to the practical value of 4. Although 4 can take any value in the interval 
0 < Xd < 1, since it influences the effective memory size, the value of 4 should be closer to 1. 
This value is determined by the number of parameters to be estimated and the desired size of 
the effective memory. Typical values used are between 0.99 and 1 (not 0.8, as we used in this 
example for demonstration). The second issue deals with the actual computation of matrix P(n). 
This matrix must be conjugate symmetric and positive definite. However, an implementation 
of the CRLS algorithm of Table 10.6 on a finite-precision processor will eventually disturb this 
symmetry and positive definiteness and would result in an unstable performance. Therefore, it 
is necessary to force this symmetry either by computing only its lower (or upper) triangular 
values or by using P(x) <— [P(n) + pi (n)]/2. Failure to do so generally affects the algorithm 
performance for A < 1. 
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Performance analysis curves of the RLS algorithm in the adaptive equalizer: 7 = 1. 
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10.6 RLS ALGORITHMS FOR ARRAY PROCESSING 


In this section we show how to develop algorithms for RLS array processing using the QR 
decomposition. The obtained algorithms (1) are algebraically equivalent to the CRLS algo- 
rithm, (2) have very good numerical properties, and (3) are modular and can be implemented 
using parallel processing. Since there are no restrictions on the input data vector, the algo- 
rithms require O (M7) operations per time update and can be used for both array processing 
and FIR filtering applications. The method of choice for applications that only require the 
a priori error e(7) or the a posteriori error ¢(7) is the QR-RLS algorithm using the Givens 
rotations. For applications that require the coefficient vector c(n), the Givens rotations— 
based inverse QR-RLS algorithm is preferred. In Section 10.7 we develop fast algorithms 
for FIR filters, with a complexity of O(M) operations per time update, by exploiting the 
shift invariance of the input data vector. 


10.6.1 LS Computations Using the Cholesky and QR Decompositions 
We start by reformulating the exponentially weighted LS filtering problem in terms of data 


matrices, as discussed in Section 8.2. If c(n) is the LS filter coefficient vector at time instant 
n, we have 


ej) =yG)—e%()x(j) Os j<n (10.6.1) 

where x(f) = bi () 2(/) ++ xm GI" (10.6.2) 
for array processing and 

x(f) =kKG)xG-) ss x G-M+D)" (10.6.3) 


for FIR filtering. We stress that c(n) should be held constant during the optimization interval 
0 < j <n. Using the (n + 1) x M data matrix 


X"(n) = [x(0) x(1) --- x(n)] 


x1(0) x11) +++ x(n) 
| x20) x21) + 2m) (10.6.4) 
xm(O) xm(1) +--+ xm(n) 


the (n + 1) x 1 desired response vector 


y(n) = [y(0) yl) «» y(ny]# (10.6.5) 
and the (7 + 1) x 1 a posteriori error vector 
e(n) = [e(0) e(1) «++ e(n)|? (10.6.6) 
we can combine the n + 1| equations (10.6.1) in a single equation as 
e(n) = y(n) — X(n)e(n) (10.6.7) 
If we define the (7 + 1) x (n+ 1) exponential weighting matrix 
A?(n) & diag {a", A"), WY} (10.6.8) 


we can express the total squared error (10.5.1) and the normal equations (10.5.2) in the 
form required to apply orthogonal decomposition techniques (see Section 8.6). Indeed, we 
can easily see that the total squared error can be written as 


E(n) = Soa" Je({)P? = | A(ne(n)|I7 (10.6.9) 
j=0 


and the LS filter coefficients are determined by the normal equations 


R(n)e(n) = d(n) (10.6.10) 

where R(n) = Soar dx(x" (/) = [A(n)X(n)]“[A(n)X(n)] (10.6.11) 
j=0 

and din) = SOM xy) = (AMX) 7 [Ay a) (10.6.12) 
j=0 


are expressed as a function of the weighted data matrix [A (”)X(n)] and the weighted desired 
response vector [A(n)y(n)]. 

In Chapter 6 we discussed how to solve the normal equations (10.6.10) by using either 
the Cholesky decomposition 

R(n) = Lin)L7 (xn) (10.6.13) 
or the LDU decomposition 
R(n) = L(n)D(n)L" (n) (10.6.14) 
where L(n) = D!/?(n)L(n). . 

The Cholesky factor L(n) can be computed either from matrix R (7) using the Cholesky 
decomposition algorithm (see Section 6.3) or from data matrix [A(”)X(7)] using one of 
the QR decomposition methods (Givens, Householder, or MGS) discussed in Chapter 8. 

aie ie 

Suppose now that the QR decomposition is 


R 
Q(n)[A(n)X(n)] = lt ea (10.6.15) 


where R(n) is a unique upper triangular matrix with positive diagonal elements and Q(n) 
is a unitary matrix. From (10.6.11) and (10.6.15) we have 
R(n) = R? (n)R(n) (10.6.16) 


which implies, owing to the uniqueness of the Cholesky decomposition, that L(n) = R” (n). 
Although the two approaches are algebraically equivalent, the QR decomposition (QRD) 
methods have superior numerical properties because they avoid the squaring operation 
(10.6.11) (see Section 8.6). 
Given the Cholesky factor R(n), we first solve the lower triangular system 
R” (n)k(n) & d(n) (10.6.17) 


to obtain the partial correlation vector k(n), using forward elimination. In the case of QRD 
the vector k(n) is obtained by transforming A(n)y(n) and retaining its first M components, 
that is, 


a | k(n) 
Qin)[A()y(n)] = 2(n) = (10.6.18) 
Z2(n) 
where k(n) = z!"1(n) (see Section 8.6). The minimum LSE is given by 
E(n) = Ey(n) — a" (nye(n) = lly(n)|? = [Ik @) |? (10.6.19) 
which was also proved in Section 8.6. 
To compute the filter parameters, we can solve the upper triangular system 
R(n)e(n) = k(n) (10.6.20) 
by backward elimination. As we discussed in Section 6.3, the solution of (10.6.20) is not 
order-recursive. 


"To comply with adaptive filtering literature, we express the QR decomposition as QX = R instead of Q7X =R, 
which we used in Chapter 8 and is widely used in numerical analysis. 
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In applications that only require the a posteriori or a priori errors, we can avoid the 
solution of (10.6.20). Indeed, if we define the LS innovations vector w() by 


R" (nyw(n) & x(n) (10.6.21) 
we obtain e(n) = y(n) — ce” (n)x(n) = y(n) —k" (n)W(n) (10.6.22) 
and e(n) = y(n) — ec” (n — 1)x(n) = y(n) —k® (n — 1I)Win) (10.6.23) 


which can be used to compute the errors without knowledge of the parameter vector c(n). 
Furthermore, since the lower triangular systems (10.6.17) and (10.6.21) satisfy the optimum 
nesting property, we can compute both errors in an order-recursive manner. 

If we know the factors L(n) and D(n) of R(n) at each time instant n, we can use the 
orthogonal triangular structure shown in Figure 7.1 (see Sections 7.1.5 and 8.5) to compute 
all én, (n) and €,,(7) for all | < m < M.A similar structure can be obtained by using the 
Cholesky factor L(n) (see Problem 10.26). 

From the discussion in Section 10.5.1 we saw that the key part of the CRLS algorithm 
is the computation of the gain vector 


R(n)g(n) = x(n) (10.6.24) 


or the alternative gain vector AR(n — Ig) = x(n). Using (10.6.16), (10.6.21), and 
(10.6.24), we obtain 


R(n)g(n) = Win) (10.6.25) 


which expresses the gain vector in terms of the Cholesky factor R(n) and the innovations 
vector w(7). Similarly with (10.6.20), (10.6.25) lacks the optimum nesting property that is 
required to obtain an order-recursive algorithm. 

To summarize, if we can update the Cholesky factors of either R(n) or R7! (n), we 
can develop exact RLS algorithms that provide both the filtering errors and the coefficient 
vector or the filtering error only. The relevant relations are shown in Table 10.8. We stress 
that the Cholesky decomposition method determines the factors R(n) and k(n) by factoring 
the matrix 


[X(n) y(n)]" A7(n)[X(n) y(n) ] 


whereas the QRD methods factor the data matrix A(n)[X(n) y(n)]. Since all these algo- 
rithms propagate the square roots R(n) or R-(n), the matrices determined by R(n) = 
R? (n)R(n) and R7! (n) = R7! (n)R-# (n) are guaranteed to be Hermitian and are more 
likely to preserve their positive definiteness. Hence, such algorithms have better numerical 
properties than the CRLS method. 


TABLE 10.8 
Triangular decomposition RLS algorithms using coefficient 
updating and direct error extraction. 


Error and coefficients updating Error-only updating 


RM (n)Win) = x(n) RM (kin) = da) 
Rin)g(n) = win) R14 (n)W(n) = x(n) 
e(n) = y(n) — e# (n — 1)xm) e(n) = y(n) —k" (n — 1)W(n) 


e(1) = e(n — 1) + g(nje*(n) 


10.6.2 Two Useful Lemmas 


We next prove two lemmas that are very useful in the development of RLS algorithms using 
QRD methods. We start with the first lemma, which stems from the algebraic equivalence 
between the Cholesky and QR decompositions. 


LEMMA 10.1. Computing the QRD of the (n + 1) x M data matrix A(n)X(n) is equivalent to 563 


evaluating the QRD of the (M + 1) x M matrix SECTION 10.6 
~ RLS Algorithms for Array 
VAR — 1) Processing 
x(n) 


Proof. Indeed, if we express A(n)X(n) as 


pe ~1)X(n—- | 
A(n)X(n) =|, (10.6.26) 
x” (n) 
and define a matrix 
On—1) 4 ae 4) | (10.6.27) 
we obtain 
ViAR(n — 1) 
Qin — 1)A(n)X(n) = | 0 (10.6.28) 
x(n) 


by using (10.6.15). If we can construct a matrix Q(n) that performs the QRD of the right- 
hand side of (10.6.28), then the unitary matrix Q(n) & Q(n)Q(n — 1) performs the QRD of 
A(n)X(n). Since the block of zeros in (10.6.28) has no effect on the construction of matrix O(n), 
the construction of Q(n) is equivalent to finding a unitary matrix that performs the QRD of 


ViAR(n — 1) 
x(n) 
The second lemma, known as the matrix factorization lemma (Golub and Van Loan 


1996; Sayed and Kailath 1994), provides an elegant tool for the derivation of QRD-based 
RLS algorithms. 


LEMMA 10.2. If A and B are any two N x M(N < M) matrices, then 
A"A =B"B (10.6.29) 
if and only if there exists an N x N unitary matrix Q (Q7Q =I) such that 
QA =B (10.6.30) 


Proof. From (10.6.30) we have B?B = A” Q“#QA = A“ A, which proves (10.6.29). To 
prove the converse, we use the singular value decomposition (SVD) of matrices A and B 


A =U,ryvi (10.6.31) 
B=Ugr,vi (10.6.32) 


where U, and Ug are N x N unitary matrices, V4 and Vg are M x M unitary matrices, and 
Za, and Xp are N x M matrices consisting of the nonnegative singular values of A and B. Using 
(10.6.29) in conjunction with (10.6.31) and (10.6.32), we obtain 


Va=VsB (10.6.33) 
and x4 = Xp (10.6.34) 
If we now define the matrix 
Q£u,U% 


and use (10.6.33) and (10.6.34), we have 
QA = UgUZ UAE Ave = UgepVi =B 


which proves the converse of the lemma. 
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10.6.3 The QR-RLS Algorithm 


We next show how to update the factors R(n) and k(n) of the extended data matrix 
A(n)[X(n) y(n)] and then compute the a priori error e(n) or the a posteriori error ¢(7). 
The findings hold independently of the method we use to construct the orthogonalizing 
matrix Q(n). 

Suppose now that at time 7 we know the old Cholesky factors R(n — 1) and k(n —1), 
we receive the new data {x(), y(”)}, and we wish to determine the new factors R(n) and 
k(n) without repeating all the work. To this end, we show that if there exists a unitary matrix 
Q(n) that annihilates the vector x” (n) from the last row of the left-hand side matrix in the 
relation 


x(n) y"(n) 1 07 en) &(n) 


Bes bee Jixk(n — 1) 


ig =©) | (10.6.35) 


then the right-hand side matrix provides the required updates and errors. The scalar a(n) is 

real-valued because it is equal to the last diagonal element of Q(”). The meaning and use 

of &(n) and w(n), which comprise the last column of Q(), will be explained in the sequel. 
If we apply Lemma 10.2 with 


en 1) kG 1) i ee k(n) a 
A= and B= 
x4 (n) y*(n) 1 04 =e (n) &(n) 


we obtain’ 
(B“B),; = R” (n)k(n) = AR” (n — DAR( — 1) + x(n)x" (n) = (A”A)q1 (1.6.36) 
(B“B) 2 = R” (n)k(n) = AR (n — Dk(n — 1) + x(n) y*(n) = (A7A)12_— (10.637) 


(B“B) 13 = R24 (n) Wn) = x(n) = (AA) 13 (10.6.38) 
(B“B)23 = k” (n)W(n) + &(n)&(n) = y(n) = (A” A) 93 (10.6.39) 
(B“B)33 = Ww" (n) Win) + @7(n) = 1 = (A A)33 (10.6.40) 


We first note that (10.6.36) is identical to the time updating (10.5.7) of the correlation 
matrix. Hence, R(n) is the Cholesky factor of R(n). Also (10.6.37) is identical, due to 
(10.6.17), to the time updating (10.5.8) of the cross-correlation vector d(n), and (10.6.38) 
is the definition (10.6.21) of the innovations vector. To uncover the physical meaning of 
é(n) and a(n), we note that comparing (10.6.39) to (10.6.22) gives 


e(n) = e(n)a(n) (10.6.41) 


which shows that é(7) is a scaled version of the a posteriori error. Starting with (10.6.40) 
and using (10.6.20), (10.6.16), and (10.5.22), we obtain 


&(n) = 1—w" (n)w(n) = 1 — x" (n)R7!(n)x(n) = a(n) (10.6.42) 

or a(n) = Ja(n) (10.6.43) 
which shows that @(n) is a normalized conversion factor. Since 

e(n) = a(n)e(n) = &(n)e(n) (10.6.4) 


using (10.6.41), we obtain 


e(n) = Je(n)e(n) (10.6.45) 


"y )ij denotes the ijth element of a block matrix. 


that is, e(n) is the geometric mean of the a priori and a posteriori LS errors. Furthermore, 
(10.6.41) and (10.6.44) give 


e(n) 


a(n) 


e(n) = (10.6.46) 


which also can be proved from (10.6.35) directly (see Problem 10.45). 

In summary, to determine the updates of R(n) and k(n) of the Cholesky factors and the 
a priori error e(n) we simply need to determine a unitary matrix Q(n) that annihilates the 
vector x (n) in (10.6.35). The construction of the matrix Q(n) is discussed later in Section 
10.6.6. 


10.6.4 Extended QR-RLS Algorithm 


In applications that require the coefficient vector, we need to solve the upper triangular 
system R(n)e(n) = k(n) by back substitution. This method is not order-recursive and 
cannot be implemented in parallel. An alternative approach can be chosen by appending 
one more column to the matrices of the QR algorithm (10.6.35). To simplify the derivation, 
we combine the first column of (10.6.35) and the new column to construct the formula 


VJaR(n—1) R-#(n—1)/Vd Rin) D(n) 
= 10.6.47 
om i (n) o# 0% — g(n) 


where D(n) and g(n) are yet to be determined. Using Lemma 10.2, we obtain 
(B“B) 12 = R47 (n)D(n) = T= (AXA) 12 (10.6.48) 


which implies that D(n) = R-# (n) is the Cholesky factor of R-!(n) and can be updated 
by using the same orthogonal transformation Q(n). Furthermore, we have 


(BYB)22 = RN) RR" (n) + B"H) = Ron — DR 1) = (A*A)2 
which, using (10.6.16), gives 
R7'(n) = P(r) = “Pon — )-#~me*@) 
Comparing the last equation to (10.5.30) gives 


g(n) = Bu) = Bey (10.6.49) 


Ja(n) a(n) 


that is, g() is a scaled version of the RLS gain vector. Using (10.5.13) gives 


c(n) = e(n — 1) + B(n)e*(n) (10.6.50) 


which provides a time-updating formula for the filter coefficient vector. This method of 
updating the coefficient vector c(n) is known as the extended QR-RLS algorithm (Yang 
and Bohme 1992; Sayed and Kailath 1994). This algorithm is not widely used because 
the propagation of both R(n) and R~4(n) may lead to numerical problems, especially 
in finite-precision implementations. This problem may be avoided by using the inverse 
QR-RLS algorithm, discussed next. Other methods of extracting the coefficient vector are 
discussed in Shepherd and McWhirter (1993). 
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ly 1 
1+ =x" (n)P(n — 1)x(n) = ——~ (10.6.51) 
Xr a(n) 
1 
tba — pxn) = 2 (10.6.52) 
xr a(n) 
1 gin) g(n) 
—-P(n—1)=P 10.6.53 
1 (n- 1) OOF Tae SG ( ) 
which combined with the Cholesky decomposition 
P(n) =R!(n) =R-'(~)R-F (0) (10.6.54) 


leads to the following identity 


1 eo 
Fe or lin —1) 1 RM — 1)x(n) Rn —1) 


Gite 0] | 1 oF 
H Ee) Fo R-# 
7 0 Tae . a 
=a Bae 
R-'(n) g(n) 


JVa(ny | LEVa(n) van) 

(10.6.55) 
where Rn) is an upper triangular matrix. From (10.6.55) and Lemma 10.2 there is a 
unitary matrix Q() such that 


ise 1 #. 0 R-“(n) 
Se Ti (n — 1)x(n) oe (n — 1) _ ; a" (n) (10.6.56) 
1 0” Ja(n) a(n) 


This shows that annihilating the vector R-# (n — 1)x(n)/ Vi = w(n)/ VA updates the 
Cholesky factor Ra (n), the normalized gain vector g(), and the conversion factor a(n). 
Again, the only requirement of matrix Q(n) is to annihilate the row vector w(n) if Vx. This 
algorithm, like the CRLS method, is initialized by setting P’(—1) = R~#(-1) = 6"'I, 
where 6 is a very small positive number. 


10.6.6 Implementation of QR-RLS Algorithm Using the Givens Rotations 


To develop a complete QRD-based RLS algorithm, we need to construct the matrix Q(n) 
that annihilates the vector x” (n) on the left-hand side of (10.6.35). Since we do not need 
the vector W() and we can compute @(n) from matrix Q(n), as we shall see later, we work 
with the following part 


VaR(n —1) VAkK(n — 1) R(n)_ k(n) 
Qin) = 


x4 (n) y*(n) 0” = en) 


R(n) 


(10.6.57) 


and show how to annihilate the elements of x“ (n), one by one, using a sequence of M 
Givens rotations. We remind the reader that the matrix R(m — 1) is upper triangular. We 


start by constructing a Givens rotation matrix G“) (1) that operates on the first and last rows 
of R(n) to annihilate the first element of x” (n). More specifically, we wish to find a Givens 
rotation such that 


cy 08 st) | VAP u(n—1) VAFi2(n-1) +) VAFiM (1) Viki — 1) 
0 I 0 0 : eine ws : 


—s, OF x¥(n) x5 (n) vee XR) y*(n) 
Fi(n) Fy2(n) +++ Fim(n) k(n) 
—1|9 ae a 
0 x(n) xn) y(n) 


To this end, we use the first element of the first row and the first element of the last row to 
determine the rotation parameters c; and s1, and then we apply the rotation to the remaining 
M pairs of the two rows. Note that for consistency of notation we define x(n) & x, (n) 
and y(n) £ y(n). Then using //A722(n) and ie (n), we determine G (n) and annihilate 
the second element of the last row by operating on the M — 1 pairs of the second row and 
the last row of the matrix G® (n)R(n). 

In general, we use the elements V1Fii(N) and xm) to determine the Givens rota- 
tion matrix G“(n) that operates on the ith row and the last row of the rotated matrix 
G“-)n)---G (n)R(n) to annihilate the element i (n). Therefore, 


kk i| bee ViFii(n— 1) ++ VAFim(n— 1) eel 


—1 a}[o---0 2m one een) yO) 
[0-0 Fe) Fist) --- Fim) — kiln) 
0---0 0 eon ery x(a) yi+) n) 
(10.6.58) 
(i) 
Mii(n —1 
where ey - (10.6.59) 
rii(n) rii(n) 
and Fi(n) = [AF — 1) + OM?! (10.6.60) 


Thus, if we perform (10.6.58) for i = 1,2,..., M, we annihilate the first M elements in 
the last row of R(m) and convert R(7) to the triangular matrix shown in (10.6.57). This 
process requires a total of M(M + 1)/2 Givens rotations. The orthogonalization matrix is 


Q(n) = G™ (n) --- GPM) G™ (n) (10.6.61) 
1 
1 
where Gn) = ci(n) as sj (n) (10.6.62) 
1 
—sj(n) tee ci(n) 


are (M+ 1) x (M + 1) rotation matrices. Note that all off-diagonal elements, except those 
in the (i, M + 1) and (M + 1, /) locations, are zero. 

From (10.6.35) we can easily see that a(n) equals the last diagonal element of Q(n). 
Furthermore, taking into consideration the special structure of Gn) and (10.6.61), we 
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obtain 
M 
an) =| Jam) (10.6.63) 
i=l 


that is, @(n) is the product of the cosine terms in the M Givens rotations. This justifies the 
interpretation of a(n) and a(n) = a(n) as angle variables. 

Although the LS solution is not defined if n < M, the RLS Givens algorithm may be 
initialized by setting R(O) = Oand k(n) = 0. The Givens rotation—based RLS algorithm is 
summarized in Table 10.9. The algorithm requires about 2M” multiplications, 2M divisions, 
and M square roots per time update. 


TABLE 10.9 
The Givens rotation—based RLS 
algorithm. 


Initialization 


Set all elements 7; ;(—1) = 0, kj(-1) = 0 


Time Recursion: n = 0, 1,... 


an)=y(n) a(n) = 1 
For i = 1 to M do 
Fj (n) = {AF2.(n — 1) + [xi (n) [241 
VAR i(n — 1) xj (n) 
~ Fin) ~ Fi) 
[If7;;(2) = 0, setc = lands = 0] 
For j =i+1toM do 


Cc 


x= cx; (n) — SFij(n -—1) 
Fj(n) = cr; (2 — 1) +s*xj;() 
xj(n) =x 

End 

é = cé(n) — skj(n — 1) 

kj (n) = ckj(n — 1) + s*é(n) 

e(n) =é 

a(n) = ca(n) 

End 
e(n) 


e(n) = é(n)a(n) or e(n) = zn) 


The algorithm in Table 10.9 may be implemented in parallel using a triangular array of 
processors, as illustrated in Figure 10.36 for M = 3. At time n — 1, the elements of R(n —1) 
and k(n — 1) are stored in the array elements. The arriving new input data [x#(n) y*(n)] are 
fed from the top and propagate downward. The Givens rotation parameters are calculated 
in the boundary cells and propagate from left to right. The internal cells receive the rotation 
parameters from the left, perform the rotation on the data from the top, and pass results to the 
cells at right and below. The angle variable a(n) is computed along the boundary cells and 
the a priori or a posteriori error at the last cell. This updating procedure is repeated at each 
time step upon the arrival of the new data. This structure was derived in McWhirter (1983) by 
eliminating the linear part used to determine the coefficient vector, by back substitution, from 
the systolic array introduced in Gentleman and Kung (1981) for the solution of general LS 
problems. Clearly, the array in Figure 10.36 performs two distinct functions: It propagates 
the matrix R(n) and the vector k(n) that define the LS array processor, and it performs, 


x1(3) X9(2) x3(1) y(0) 
x\(2) X91) x3(0) 0 


e(n) 


Xout 


FIGURE 10.36 
Systolic array implementation of the QR-RLS algorithm and 
functional description of its processing elements. 


although in a not-so-obvious way, the filtering operation by providing at the output the error 
é(n) or e(n). Figure 10.36 provides a functional description of the processing elements 
only. In practice, there are different hardware and software implementations using systolic 
arrays, wavefront arrays, and CORDIC processors. More detailed descriptions can be found 
in McWhirter and Proudler (1993), Shepherd and McWhirter (1993), and Haykin (1996). 


10.6.7 Implementation of Inverse QR-RLS Algorithm Using the Givens Rotations 


If we define the vector 


win) * Rn — 1)x(n) (10.6.64) 


and the scalar a(n) =1+w*(n)win) (10.6.65) 


1 
a(n) 


we can express (10.6.56) as 


Qin) (10.6.66) 


w(n) SRM) 7 0 R-7(n) 
a(n) 8" (n) 


1 oF 


where g(7) is the normalized gain vector (10.6.49). The matrix Q(7) will be chosen as a 
sequence of Givens rotation matrices G (n) defined in (10.6.62). 

We first show that we can determine the angle parameters c;(n) and s;(n) of G(n) 
using only the elements of w(7). To this end, we choose the angle parameters of the rotation 
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matrix G“)) (n) in 


wy (n) 0 
w2(n) w2(n) 
Gm} : f=]: (10.6.67) 
wa (n) wma (n) 
1 &1(n) 


to annihilate the first element w (). Note that owing to the structure of G“)(n) the re- 
maining elements of w(n) are left unaffected. Since unitary transformations preserve the 
Euclidean norm of a vector, we can easily see that 
Az(n) = 1+ [wy (n)/? 

which expresses 1 (7) in terms of W(n). From the first and last equations in (10.6.67), we 
have the system 

ci(n)W1(n) + s}(n) = 0 

—s1(n)wi(n) + c1(n) = &1(n) 


1 
whose solution cy(n) = ——— 54 (n) = —-———_ 
1 (1) a1 (1) 
provides the required parameters. Similarly, we can determine the rotation G®(n) to an- 
nihilate the element w2(n) of the vector on the right-hand side of (10.6.67). The required 
rotation parameters are 


_ &(n) _ WM) 
aay, ee 
where &5(n) = 14+ [wi (n)P? + |Wa(n)? = GF (n) + wan)? 


provides a recursive formula for the computation of @;(n). The remaining elements of w(n) 


can be annihilated by continuing in a similar way. In general, fori = 1, 2,..., M, we have 
&j(n) = [67_,(2) + |v) 7]? dion) = 1 (10.6.68) 

w*(n 
PG EO? | eee (10.6.69) 

aj (n) a(n) 


and &(n) = d@y(n). 
Let us denote by pj; (1) the elements of matrix Ri (n) and by gt (n) the elements of 
Mae @" (n) after the ith rotation. The first rotation updates the first efement of the matrix 
—H(n — 1) / ./k and modifies the first element of gf! (n). Indeed, from 


1. ee) 0 --- 0 
—R-# (n-1) 
GY (n)| Vr ‘ = a: Cie, (10.6.70) 
0” Po) ee 
p\\(n) : (n) pii(n — 1) 
1 n) = —cC(n n— 
we obtain Pil hi 1M)Pi1 


Owes GietO=t 
' Vi 


Multiplication of (10.6.70) by G®)(n) updates the second row of R7”(n — 1)//A and 
modifies the first two elements of ¢” (1). In general, the ith rotation updates the ith row of 


Ri (n —1)/ /X and modifies the i first elements of raed (n) using the formulas 


Bij(n) = eeu pista — 1) +sf()gi(n) (10.6.71) 
gn) = cing) - = (n) pij(n — 1) (10.6.72) 


for! <i < Mand1 < j <i.Theserecursions are initialized with gi (n) =0,1<j <M, 
i < j, and provide the required quantities after M rotations. The complete inverse QR- 
RLS algorithm is summarized in Table 10.10, whereas a systolic array implementation is 
discussed in Alexander and Ghirnikar (1993). 


TABLE 10.10 
Summary of the inverse QR-RLS Givens 
algorithm.’ 


Initialization 


e(-D=x-)=0 pj(-l)=d>1 


Time Recursion: n = 0,1,... 


e(n) = y(n) — e# (n — 1)x(n) 
gQm)=0 1<jisMisj 
Go(n) = 1 
For i = 1 to M do 

ae 

wj(n) = Vi x Pij( — Ix) 


&(n) = [47_,(n) + |; (n) 71! 


oy 1M) bpd we (n) 
A= a 
For j = | toi do 
pi;(n) = Fn) pii(n —1)+ s*(n) GD @) 
Pij = Vd i Pij Sj 8; 
gn) =¢j ing! n) = =" (n) pij — 1) 
End 
End 
A) 
= ayn) 


For m = 1 to M do 


cm(n) = em(n = 1) + 8 (n)e* (n) 
End 


+The computations can be done “in-place” using temporary 
variables as shown in Table 10.9. 


10.6.8 Classification of RLS Algorithms for Array Processing 


Whereas the CRLS algorithm provides the basis for the introduction and performance 
evaluation of exact LS adaptive filters for array processing, the Givens rotation—based QR- 
RLS algorithms provide the most desirable implementation in terms of numerical behavior 
and ease of hardware implementation. However, there are many more algorithms that have 
interesting theoretical interpretations or may better serve the needs of particular applications. 
In general, we have the following types of RLS algorithms. 
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1. The CRLS algorithm, which is a fixed-order algorithm, updates the inverse P(n) = 
R-! (n) of the correlation matrix and then computes the gain vector through a matrix- 
by-vector multiplication (see Section 10.5). 

2. a Reaaians square root algorithms propagate either R(n) or its inverse P(n) 4 

R-'(), using formulas derived from the Cholesky decomposition of R(n) or P(n) = 
R-!(n), respectively. They include two types: 


a. Algorithms that propagate {R(n), k(n)} (information filter approach) or {R-!(n), 
k(n)} (covariance filter approach’ ) and provide the a priori or a posteriori errors 
only. 

b. Algorithms that propagate R(n) and compute g(n) by solving (10.6.25) or propagate 
R7!(n) and compute g(n) in a matrix-by-vector multiplication. Both algorithms 
compute the parameter vector c(7) and the error e(n) or e(n). 


3. Amplitude-domain square root algorithms that propagate either R(n) (QRD-based RLS) 
or its inverse P(n) = R~!(n) (inverse QRD-based RLS) working directly with the data 
matrix A(n)[X(7) y(7)]. In both cases, we can develop algorithms providing only the 
error e() or €(n) or both the errors and the parameter vector c(n). 


Algorithms that propagate the Cholesky factor R(n) avoid the loss-of-symmetry prob- 
lem and have better numerical properties because the condition number of R(n) equals the 
square root of the condition number of R(n). Because QRD-based al gorithms have superior 
numerical properties to their Cholesky counterparts, we have focused on RLS algorithms 
based on the QRD of the data set A(n)[X(7) y(7)]. More specifically we discussed QRD- 
based RLS algorithms using the Givens rotations. Other QRD-based RLS algorithms using 
the MGS (Ling et al. 1986) and Householder transformations (Liu et al. 1992; Steinhardt 
1988; Rader and Steinhardt 1986) also have been developed but are not as widely used. 

It is generally accepted that QR decomposition leads to the best methods for solving 
the LS problem (Golub and Van Loan 1996). It has been shown by simulation that the 
Givens rotation—based QR-RLS algorithm is numerically stable for 7 < 1 and diverges 
for A = 1 (Yang and Bohme 1992; Haykin 1996). This is the algorithm of choice for 
applications that require only the a priori or a posteriori errors. Since the extended QR-RLS 
algorithm propagates both R(n) and R~"(n) independently from each other, in finite- 
precision implementations, the computed values of R(n) and R-# (n) deviate from each 
other’s Hermitian inverse. As a result of this numerical inconsistency, the algorithm becomes 
numerically unstable (Haykin 1996). To avoid this problem, we can use either the QR-RLS 
algorithm with back substitution “on the fly” or the inverse QR-RLS algorithm (Alexander 
and Ghirnikar 1993; Pan and Plemmons 1989). The updating of ¢() with this last algorithm 
can be implemented in systolic array form without interrupting the adaptation process. 

___ If we factor out the diagonal elements of matrix R(n), obtained by QRD, we can express 
R(n) as 


R(n) = D'? (Rim) (10.6.73) 
where Rj (1) is an upper triangular matrix with unit diagonal elements, and 
D(n) © diag{7}, (0), 7(n),...,7ipy()} (10.6.74) 


is a diagonal matrix with positive elements. We can easily see that R! (n) and D(n) provide 
the factors of the LDU decomposition (10.6. 14). It turns out that (10.6.73) provides the basis 
for various QRD-based RLS algorithms that do not require square root operations. In similar 
manner, the LDU decomposition makes possible the square root—free triangularization of 
R (n) (see Section 6.3). All algorithms that use the Cholesky factor R (n) or its inverse R! (n) 
require square root operations, which we can avoid if we use the LDU decomposition factors 
R, (n) and D(n). Because such algorithms have inferior numerical properties to their square 


"The terms information and covariance filtering-type algorithms are used in the context of Kalman filter theory 
(Bierman 1977; Kailath 1981). 


root counterparts and are more prone to overflow and underflow problems, and because 
square root operations are within the reach of current digital hardware, we concentrate on 
RLS algorithms that propagate the Cholesky factor or its inverse (Stewart and Chapman 
1990). However, square root—free algorithms are very useful for VLSI implementations. 
The interested reader can find information about such algorithms in Bierman and Thornton 
(1977), Ljung and Soderstrom (1983), Bierman and Thornton (1977), and Hsieh et al. (1993). 

A unified derivation of the various RLS algorithms using a state-space formulation and 
their correspondence with related Kalman filtering algorithms is given in Sayed and Kailath 
(1994, 1998) and in Haykin (1996). 

All algorithms mentioned above hold for arbitrary input data vectors and require 
O(M?°) arithmetic operations per time update. However, if the input data vector has a 
shift-invariant structure, all algorithms lead to simplified versions that require O(M) arith- 
metic operations per time update. These algorithms, which can be used for LS FIR filtering 
and prediction applications, are discussed in the following section. 


10.7 FAST RLS ALGORITHMS FOR FIR FILTERING 


In Section 7.3 we exploited the shift invariance of the input data vector 


fxm] [x 
SiG) | = io _ ‘ (10.7.1) 


to develop a lattice-ladder structure for optimum FIR filters and predictors. The determi- 
nation of the optimum parameters (see Figure 7.3) required the LDL” decomposition of 
the correlation matrix R(m) and the solution of three triangular systems at each time n. 
However, for stationary signals the optimum filter is time-invariant, and the coefficients 
of its direct or lattice-ladder implementation structure are evaluated only once, using the 
algorithm of Levinson. 

The key for the development of order-recursive algorithms was the following order 
partitioning of the correlation matrix 


Rn(n) x(n) ba nA (n) 


10.7.2 
rin) Px(n—m) r,(2) Rm(n — 1) 


Rn4i(@) = 


which is a result of the shift-invariance property (10.7.1). The same partitioning can be 
obtained for the LS correlation matrix R,,(”) 


n 
Rnsi@) = oa xm4 DG) 
j=0 


hee Fh, (7) lees F (n) 


pH (n) Ex(n—m) re (rn) Rn(n—1) 


(10.7.3) 


if we assume that x,,(—1) = 0, a condition known as prewindowing (see Section 8.3). This 
condition is neccesary to ensure the presence of the term R,,(n—1) in the lower ri ght corner 
partitioning of Ryvt (n). 

The identical forms of (10.7.2) and (10.7.3) imply that the order-recursive relations and 
the lattice-ladder structure developed in Section 7.3 for optimum FIR filters can be used for 
prewindowed LS FIR filters. Simply, the expectation operator E {(-)} should be replaced by 
the time-averaging operator vi=0 4”—J(-), and the term power should be replaced by the 
term energy, when we go from the optimum MSE to the LSE formulation. 

In this section we exploit the shift invariance (10.7.1) and the time updating 


Ry (n) = ARn(n — 1) + Xm(n)x# (n) (10.7.4) 
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TABLE 10.11 


to develop the following types of fast algorithms with O(M) complexity: 


1. Fast fixed-order algorithms for RLS direct-form FIR filters by explicitly updating the 
gain vectors g(n) and g(n). 

2. Fast order-recursive algorithms for RLS FIR lattice-ladder filters by indirect or direct 
updating of their coefficients. 

3. QR decomposition—based RLS lattice-ladder algorithms using the Givens rotation. 


All relationships in Section 7.3 are valid for the prewindowed LS problem, but we replace P 
by E to emphasize the energy interpretation of the cost function. The quantities appearing 
in the partitionings given by (10.7.3) specify a prewindowed LS forward linear predictor 
—a,, and an LS backward linear predictor —b,,. Table 10.11 shows the correspondences 
between general FIR filtering, FLP, and BLP. Using these correspondences and the normal 
equations for LS filtering, we can easily obtain the normal equations and the total LSE for 
the FLP and the BLP, which are also summarized in Table 10.11 (see Problem 10.28). We 
stress that the predictor parameters a,, (7) and b,», (7) are held fixed over the optimization 
intervalO < j <n. 


Summary and correspondences between LS FIR filtering, forward linear prediction, and backward linear prediction. 


FIR filter FLP BLP 

Input data vector Xm(n) Xm(n — 1) Xm (n) 

Desired response y(n) x(n) x(n —m) 

Coefficient vector Cm (n) —am(n) —bm(n) 

Error Em(n) = y(n) — ef (2) Xm (n) em () = x(n) Fay ()Xm(n— 1) eb, (2) = x(n — m) + bf (A) Xm (0) 
n n n 

Cost function Em(n) = oF emi? Ein) = Do eh GP Eh) = Soa eh OP 

j=0 j=0 j=0 

Normal equations Ryn (1)em(n) = dm (n) Ry (1 — 1am (n) = i, (n) Ry (n)by (n) = —7, (n) 

LSE Em(n) = Ey(n) — ef (n)din(n) Ej, (2) = Ex(n) + ag (n) Fy, (0) Ep (n) = Ex(n — m) + byl (nF, (2) 
n 

Correlation matrix Ry» (n) = os W—-Jxm (xt) Rm — 1) Ry (n) 


Cross-correlation vectors 


j=0 


dn(n) = SOM xm HL) = DO a Fx — Dx*G) BB) = D0 aS xm 2*G — m) 


j=0 j=0 j=0 


Table 10.12 summarizes the a priori and a posteriori time updates for the LS FIR filter 
derived in Section 10.5. If we use the correspondences between general FIR filtering and 
linear prediction, we can easily deduce similar time-updating recursions for the FLP and 
the BLP. These updates, which are also discussed in Problem 10.29, are summarized in 
Table 10.12. 


10.7.1 Fast Fixed-Order RLS FIR Filters 


The major computational task in RLS filters is the computation of the gain vector g(n) 
or g(n). The CRLS algorithm updates the inverse matrix R-! (n) and then determines the 
gain vector via a matrix-by-vector multiplication that results in O(M7) complexity. The 
only way to reduce the complexity from O(M7”) to O(M) is by directly updating the gain 
vectors. We next show how to develop such algorithms by exploiting the shift-invariant 
structure of the input data vector shown in (10.7.1). 


TABLE 10.12 
Summary of LS time-updating relations using a priori and a posteriori errors. 


Equation A priori time updating A posteriori time updating 
Gain (a) Rm (2) 8m (2) = Xm (n) ARm (n — 1)8m (2) = Xm (n) 
Filter (b) em(n) = y(n) — eff (n — 1)Xm(n) em(n) = y(n) — eff (2)Xm (n) 
(c) €m(n) = ¢m(n — 1) + Bm (n)e%, (n) €m(n) = em(n — 1) + Bm (nex, (n) 
(d) Em (n) = AEm(n — 1) + am (n)lem(n) I Em(n) = 2Em(n — 1) 4 em 
FLP(e) e} (n) = x(n) tal — 1)xm(n — I) eh (n) = x(n) + al Ghee 1) 
(f) am (2) = am (n — 1) — gm(n — Hel#(n) am(n) = am (n — 1) — mn — Deb) 
(g) Ef (n) =AEE n= 1) +am(n— Dien Ef) = AEE (n — 1) + me 
Qm(n — 1) 
BLP (h) ep, (n) = x(n — m) + b# (n — 1)xm(n) eb, (n) = x(n — m) + bE (n)xm(n) 
(i) bin (1) = Bm (n — 1) — im (n)eb*(n) bin (1) = Bm (n — 1) — Bm (ned (n) 
Wi) ER (n) = AEB (n =D Fam(nleh( ER (n) = RES (n= 1) 4 enor 


Fast Kalman algorithm: Updating the gain g(n) 
Suppose that we know the gain 
En(n — 1) = Ro! (n — 1)Xm(n — 1) (10.7.5) 
and we wish to compute the gain 
n(n) =R,,!(n)xXm(n) (10.7.6) 


at the next time instant by “adjusting” g,,(m — 1), using the new data {x (1), y(n)}. 
7 If we use the matrix inversion by partitioning formulas (7.1.24) and (7.1.26) for matrix 
Ryn+i(7), we have 


‘ | RA) On 1 | ba@| oy 
Rn4i() = i . |: En) | |e (n) 1] (10.7.7) 
d R a % ae" 1 all 10.7.8 
an m+1(n) = On Ron) + tw att) [1 at(n)] (10.7.8) 


as was shown in Section 7.1. 
Using (10.7.7), the first partitioning in (10.7.1), and the definition of ee (n) from Table 


10.12, we obtain 
— [fem] 2b) [bn 


which provides a pure order update of the gain vector gj, (nm). Similarly, using (10.7.8), the 
second partitioning in (10.7.1), and the definition of oa (n) from Table 10.12, we have 


i 0 ef, (n) | 1 (10.7.10) 
Bt Ute yi) * BEG) (pte) - 


which provides a combined order and time update of the gain vector g,,(n). This is the key 
to the development of fast algorithms for updating the gain vector. 

Given the gain g,, (n — 1), first we compute g,,,4 | (7), using (10.7.10). Then we compute 
Zm(n) from the first m equations of (10.7.9) as 


Bm(n) = gl"! (n) — g"*) nbn (0) (10.7.1) 
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m+1 ee (n) 
because get (n) = i < (10.7.12) 
from the last equation in (10.7.9). The updatings (10.7.9) and (10.7.10) require time updat- 
ings for the predictors a,,(”) and b,, (1) and the minimum error energies Et, (n) and E es (n), 
which are given in Table 10.12. The only remaining problem is the coupling between g,,, (1) 
in (10.7.11) and b,,(”) in 


Bn (2) = Bm(n — 1) — gm(n)eb*(n) (10.7.13) 


which can be avoided by eliminating b,, (1). Carrying out the elimination, we obtain 


glli(n) — gh? (n)bm(n — 1) 


1 
1— gi") ned (n) 


which provides the last step required to complete the updating. This approach, which is 
known as the fast Kalman algorithm, was developed in Falconer and Ljung (1978) using 
the ideas introduced by Morf (1974). To emphasize the fixed-order nature of the algorithm, 
we set m = M and drop the order subscript for all quantities of order M. The computa- 
tional organization of the algorithm, which requires 9M operations per time updating, is 
summarized in Table 10.13. 


Bn(n) = (10.7.14) 


TABLE 10.13 
Fast Kalman algorithm for time updating of LS FIR filters. 


Equation Computation 


Old estimates: a(n — 1), b(v — 1), 27 — 1), en — 1), EF — 1) 
New data: {x(n), y(n)} 


Gain and predictor update 


(a) ef (n) = x(n) + a4 (n — 1)x(n — 1) 
(b) a(n) = a(n — 1) — g(n — 1e!*(n) 
(c) ef(n) = x(n) +a” (n)x(n — 1) 
(d) Et (n) = Et (n— 1) + ef (nyef*(n) 

(0) ef (n) [1 
(e) Su+i(2) = ie = ‘| + FF ia) (aes | 
(f) e>(n) = x(n — M) +b” (n — 1)x(n) 

a a) — a? @ba — 1) 

(8) a(n) = (+1) 

1 ging) ne (n) 
(h) b(n) = b@ — 1) — g@e™*(n) 

Filter update 

(i) e(n) = y(n) — c# (n — 1)x(n) 
Gj) e(n) = e(n — 1) + g(n)e*(n) 


The FAEST algorithm: Updating the gain g(7) 
In a similar way we can update the gain vector 
- Ue 
Bm (n) = Rn (n — 1)Xm(n) (10.7.15) 


by using (10.7.9) and (10.7.10). Indeed, using (10.7.10) with the lower partitioning (10.7.1) 
and (10.7.9) with the upper partitioning (10.7.1), we obtain 


é eh en) _|h 10.7.16 
Bn = Van) * RER@—D [ant — 1 re 


: Bm (12) ep(n) | Bm(n — 1) 

d = a 10.7.17 
an Bm1(n) k eo (10.7.17) 
which provide a link between g,,(m — 1) and g,, (7). From (10.7.17) we obtain 

Sm(n) = 81" (0) — BA) b(n — 1) (10.7.18) 
b 
-(m+1) €m (1) 
because a TEP in = 1) (10.7.19) 


from the last row of (10.7.17). The fundamental difference between (10.7.9) and (10.7.17) 
is that the presence of b,,(” — 1) in the latter breaks the coupling between gain vector and 
backward predictor. Furthermore, (10.7.19) can be used to compute eb (n) by 


eh (n) = AER (n — Has (n) (10.7.20) 


with only two multiplications. 

The time updatings of the predictors using the gain g,,(”), which are given in Table 
10.12, require the a posteriori errors that can be computed from the a priori errors by using 
the conversion factor 


Am(n) = 1+ 2” (n)Xm (n) (10.7.21) 
which should be updated in time as well. This can be achieved by a two-step procedure as 
follows. First, using (10.7.16) and the lower partitioning (10.7.1), we obtain 

len (72) |" 
AEE (n — 1) 
which is a combined time and order updating. Then we use (10.7.17) and the upper parti- 
tioning (10.7.1) to obtain 


A@m+1(1) = Am(n — 1) + (10.7.22) 


Gn (0) = &m41(n) — B"4? n)eP* (n) (10.7.23) 
oe Jem(n)l? 
or Am (1) = An41(n) — REP (n—1) (10.7.24) 


which in conjunction with (10.7.22) provides the required time update a,(n — 1) > 
Am+1(2) > Am(n). 

This leads to the fast a posteriori error sequential technique (FAEST) algorithm pre- 
sented in Table 10.14, which was introduced in Carayannis et al. (1983). The FAEST 
algorithm requires only 7M operations per time update and is the most efficient known 
algorithm for prewindowed RLS FIR filters. 


Fast transversal filter (FTF) algorithm. This is an a posteriori type of algorithm ob- 
tained from the FAEST by using the conversion factor 
An (n) = 1— g! (n)Xm (n) (10.7.25) 


instead of the conversion factor @ (1) = 1/a(n). Using the Levinson recursions (10.7.9) 
and (10.7.10) in conjunction with the upper and lower partitionings in (10.7.1), we obtain 


b 2 
m+1(7) = Am (n) — aa (10.7.26) 
f 2 
and m41(2) = &m(n — 1) — oo (10.7.27) 


respectively. To obtain the FTF algorithm, we replace @,, (n) in Table 10.14 by 1/a,(n) and 
Equation (h) by (10.7.27). To obtain @,, (2) from a, 4.1 (1), we cannot use (10.7.26) because 
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TABLE 10.14 
FAEST algorithm for time updating of LS FIR filters. 


Equation Computation 


Old estimates: a(n — 1), b(n — 1), en — 1), &(n — 1), E! (n— 1), E>(n — 1), &(n — 1) 
New data: {x(n), y(n)} 


Gain and predictor update 


(a) ef (n) = x(n) + a4 (n — 1)x(n — 1) 
f 
f _ @é (n) 
(b) Ae aay 
(c) a(n) = a(n — 1) — B(n — L)ef*(n) 
(d) Ela) =akf(n— 1 + eh (nel (n) 
E _ fo ein) [1 
a Su+i(") = ee 2 al + cEf@—D be - | 
(f) e(n) = 2E%(n — 1g a) 
(g) a(n) = Byrn) — BA? bm — 1) 
2 : _ le @? 
(h) ay+i(n) =a(n—1)4 Ef —D 
@ &(n) = &yzi(n) — Bt * (n)e?(n) 
(i) b(n) = b(n — 1) — B(n)e*(n) 
b 
(k) e(n) = 7 im) 
a(n) 
0) E?(n) = 2E(n — 1) + e?(n)e*(n) 
Filter update 
(m) e(n) = y(n) — c4 (n — 1)x(n) 
e(n) 
(n) e(n) = —— 
a(n) 
(0) e(n) = e(n — 1) + B(n)e* (n) 


it requires quantities dependent on @,,(n). To avoid this problem, we replace Equation (7) 
by the following relation 


Om+1(") 


= 1 
1 ome (n) ger? (n)ed*(n) 


(10.7.28) 


Om (n) = 


obtained by combining (10.7.24), (10.7.19), and a», (n) = 1/am(n). This algorithm, which 
has the same complexity as FAEST, was introduced in Cioffi and Kailath (1984) using a 
geometric derivation, and is known as the fast transversal filter (FTF) algorithm. 

An alternative updating to (10.7.27) can be obtained by noticing that 


f 2 
Om41(2) = om(n — 1) — a2. (n a 
= tem ~am(n — let, (n) 7] 
or equivalently (n) ( 1 ten a) (10.7.29) 
Am4+1(1) = Am (n — 1) —————— ds 
Et (n) 


which can be used instead of (10.7.27) in the FTF algorithm. In a similar way, we can show 


that 
NEP (n — 1) 


EP (n) (10.7.30) 


Am+1 (n) = m(n) 
which will be used later. 


Some practical considerations 


Figure 10.37 shows the realization of an adaptive RLS filter using the direct-form 
structure. The coefficient updating can be done using any of the introduced fast RLS algo- 
rithms. Some issues related to the implementation of these filters using multiprocessing are 
discussed in Problem 10.48. 


x(n) 


y(n) e(n) 


Coefficient 
updating 


FIGURE 10.37 
Implementation of an adaptive FIR filter using a direct-form structure. 


In practice, the fast direct-form RLS algorithms are initialized at n = 0 by setting 
Ef(-1) = E>(-1)=5>0 


_ (10.7.31) 
a(-l)=1 or a(-l)=1 


and all other quantities equal to zero. The constant 6 is chosen as a small positive number 
on the order of 0.0102 (Hubing and Alexander 1991). For A < 1, the effects of the initial 
conditions are quickly “forgotten.” An exact initialization method is discussed in Problem 
10.31. 

Although the fast direct-form RLS algorithms have the lowest computational complex- 
ity, they suffer from numerical instability when A < | (Ljung and Ljung 1985). When these 
algorithms are implemented with finite precision, the exact algebraic relations used for their 
derivation breakdown and lead to numerical problems. 

There are two ways to deal with stabilization of the fast direct-form RLS algorithms. 
In the first approach, we try to identify precursors of ill behavior (warnings) and then use 
appropriate rescue operations to restore the normal operation of the algorithm (Lin 1984; 
Cioffi and Kailath 1984). One widely used rescue variable is 

ae OD AEm (a = 1) (10.7.32) 
Om (nN) E 2 (n) 
which satisfies 0 < n,,(n) < 1 for infinite-precision arithmetic (see Problem 10.33 for 
more details). 

In the second approach, we exploit the fact that certain algorithmic quantities can be 
computed in two different ways. Therefore, we could use their difference, which provides 
a measure of the numerical errors, to change the dynamics of the error propagation system 
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and stabilize the algorithm. For example, both e> (n) and a,(n) can be computed either 
using their definition or simpler order-recursions. This approach has been used to obtain 
stabilized algorithms with complexities 9M and 8M; however, their performance is highly 
dependent on proper initialization (Slock and Kailath 1991, 1993). 


10.7.2 RLS Lattice-Ladder Filters 


The lattice-ladder structure’ derived in Section 7.3 using the MSE criterion, due to the 
similarity of (10.7.2) and (10.7.3), holds for the prewindowed LSE criterion as well. This 
structure, which is depicted in Figure 10.38 for the a posteriori error case, is described by 
the following equations 


eh(n) = ed(n) = x(n) 


ase mt+ehme>(n—1) O<m<M-1 (10.7.33) 


m- 


ein =e&nm—D+ke nein) O<m<M-1 (10.7.34) 


m- 


for the lattice part and 


eg(n) = y(n 
eee , (10.7.35) 
Emei (2) = Em (n) — keX(n)e&(n)  O<m<M—1 
for the ladder part. The lattice parameters are given by 
Bm) 
kin) = -— 10.7.36 
10 Se =a (10.7.36) 
Bm”) 
and Piano (10.7.37) 
os Ej, (1) 
and the ladder parameters by 
Bn 
e(aj=— (10.7.38) 
- Ej, (n) 
where Bn(n) = bE (n — rl) + rh (n) (10.7.39) 
and Bin) = b! (n)din (n) + dn4i(n) (10.7.40) 


are the partial correlation parameters. 

However, as we recall, the time updating of the minimum LSE energies and the partial 
correlations is possible only if there is a time update for the correlation matrix R,,(”) and 
the cross-correlation vector dj» (7). 

The minimum LSE energies can be updated in time using 


Et (n) =rEt(n — 1) + ef (nyel* (n) (10.7.41) 
E> (n) = AE? (n — 1) +e? (nye*(n) (10.7.42) 


or their variations, given in Table 10.12. 
To update the partial correlation £,,,(1), we start with the definition (10.7.39) and then 
use the time-updating formulas for all involved quantities, rearranging and recombining 


‘In Chapter 7 we used the symbol e(7) because we had no need to distinguish between a priori and a posteriori 
errors. However, since the error e(1) in Section 7.3 is an a posteriori error, we now use the symbol é(n). 
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FIGURE 10.38 
A posteriori error RLS lattice-ladder filter. 


terms as follows: 
Bn(nt+ 1) =be ari at trl j@th 
= bi (n)[Art, (n) + Xn (n)x*(n + 1] 
+ [ark , (2) +x(n — m)x*(n + 1)] 
= Ab! (n)r}, (n) +e? (n)x*(n +1) + ark.) 
= Abi (n — 1) — eb, Bm (n)Ief, (0) 
+ Arf y(n) +e? (n)x*(n + 1) 
= AB, (n) + €° (n)ix*(n + 1) — AB (DI, ()] 
= ABm(n) +e? (n)ix*(n + 1) — xf @)R, 1 — Dri, (7) 
= AB, (n) + &b, (n)[x*(n + 1) + x4 (nya) 
= AB, (n) + €> (nyel*(n + 1) 


which provides the desired update formula. The updating 


Bn(1) = ABy(n — 1) +e (n — 1et*(n) (10.7.43) 
=ABa(n-D+ : e? (n— Let*(n) (10.7.44) 
Om(n — 1) 


is feasible because the right-hand side involves already-known quantities. 
In a similar way (see Problem 10.36), we can show that 


By, (nm) = APY (n — 1) + e> (n)e* (n) (10.7.45) 


= Api (n-—1) + e? (nex (n) (10.7.46) 


Om (n) 
which facilitates the updating of the ladder parameters. 

To obtain an a posteriori algorithm, we need the conversion factor a(n), which can 
be obtained using the order-recursive formula (10.7.26). A detailed organization of the 
a posteriori LS lattice-ladder algorithm, which requires about 20M operations per time 
update, is given in Table 10.15. The initialization of the algorithm is easily obtained from 
the definitions of the corresponding quantities. The condition ag(n — 1) = 1 follows from 
(10.7.25), and the positive constant 6 is chosen to ensure the inveribility of the LS correlation 
matrix R(n) (see Section 10.5).The time-updating recursions (c) and (d) can be replaced 
by order recursions, as explained in Problem 10.37. 
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TABLE 10.15 
Computational organization of a posteriori 
LS lattice-ladder algorithm. 


Equation Computation 


Time initialization (n = 0) 


Ef (-1) = E>(-1) =6>0 0<m<M-1 
Bm(-D =0, 65 (-1) =0 O0<m<M-I1 
Bin (-1) =0 0<m<M-1 


Order initialization 


(a) eh(n) =eb(n)= x(n) egin)= y(n) ag(n- 1) = 1 


Lattice part: m =0,1,...,M-—1 


e> (n — lel (n) 


(b) Bm (1) = ABm(n — 1) 4 


Am(n — 1) 
f 2 
(c) EE (n) =r8f a —1) + on 
Am(n — 1) 
b 2 
(d) Fb) =ABb (a —1) + Ln! 
Om (n) 
(e) Kin (a) = ee 
En(a— 1) 
; —Byy (n) 
kb (n) = —* 
(f) m (2) EE (n) 
(g) ef p(n) = eh, (n) + KES eb (a — 1) 
(h) eP y(n) = eb (n — 1) + EF (net, (n) 
b 2 
(i) Am+1(N) = An (n) — oa 
Em(n) 


Ladder part: m = 1,2,...,M 


(i) BS (n) = ABS, (n — 1) + >, (nex, (n)/em(n) 
En (n) 
0) Em+1 (2) = &m(n) — ko*(n)e>, (n) 


If instead of the a posteriori errors we use the a priori ones, we obtain the following 
recursions 


ep(n) = eb(n) = x(n) 
ei) =e,(n)+h*(n—VDeh(n—1)  O<m<M-—1_ (10.747) 
eb in) =en(n—1) +k (n— ley, (n)  O<m<M-—1_ (10.7.48) 
for the lattice part and 


eo(n) = y(n) 


os : (10.7.49) 
em+1(2) = Em(n) — k(n — len, (a) l<m<M 


for the ladder part (see Problem 10.38). As expected, the a priori structure uses the old LS 
estimates of the lattice-ladder parameters. Based on these recursions, we can develop the a 
priori error RLS lattice-ladder algorithm shown in Table 10.16, which requires about 20M 
operations per time update. 


TABLE 10.16 
Computational organization of a priori LS lattice-ladder 


algorithm. 
Equation Computation 
Time initialization 
Ej, (—1) = EB (-1) =8 > 0 
Bmi-l) =0  eb(-1I)=0 O<m<M-1 
eC)=0 OK<m<M-1 
Order initialization 
(a) eb(n) = ep(n) = 2 (8) eg(n) = y(n) agv—1)=1 
Lattice Part: m =0,1,...,M—2 
(b) ef) =eb (an) + tn — Deb (n — 1) 
(c) eb a) =e nm — 1) + k(n — Del, (n) 
(d) Bin (1) = ABm(n — 1) + om (n — 1) eb (n — 1) ef (n) 
(e) Ef) = AE} (a — 1) +am(n— Ile, (n)/? 
(f) E> (n) = AE® (n— 1) +am(n)le®, (n) |? 
—Bm(n) 
fas = m 
(g) m (1) Ba-b = 
_ R* 
(h) Kb (n) = ae 
En) 
Je> _ (ny? 
(i) Am (2) = Am—1 (1) — ah axe 
En-l (n) 
Ladder part: m =1,2,...,M 
Gi) 6 (n) = ABS, (n — 1) + am (nel, (nex, (n) 
(k) Ke(n) = En 
0) emi (2) = m(n) — ket (n — Lem (n) 


10.7.3 RLS Lattice-Ladder Filters Using Error Feedback Updatings 


The LS lattice-ladder algorithms introduced in the previous section update the partial cor- 
relations B,,(n) and BF, (n) and the minimum error energies Ef (n) and falas (n), and then 
compute the coefficients of the LS lattice-ladder filter by division. We next develop two 
algebraically equivalent algorithms, that is, algorithms that solve the same LS problem, 
which update the lattice-ladder coefficients directly. These algorithms, introduced in Ling 
et al. (1986), have good numerical properties when implemented with finite-word-length 
arithmetic. 


Starting with (10.7.38) and (10.7.45) we have 


mM) — , By (n—1) En(a— 1), am(n)eh, (ney, (2) 


Kn) = Bea) EB 1) ERG) En@) 


(10.7.50) 


= Bal (n — 1)KE® (n — 1) +am(nje® (nex (n)] 


or using VE? (n — 1) = EP (n) — am (n)e® (n)e>*(n) 
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we obtain 
c _ 4e Om (n)ep, (n) * c bx 
kin”) =k, (a — 1) + EP (n) len (1) — Ki, (a — Den ()] 
b * 
7 i (n) =n —1) 4 Am (Nem (Mer , (1) (10.751) 


EP (n) 


using (10.7.49). Equation (10.7.51) provides a direct updating of the ladder parameters. 
Similar direct updating formulas can be obtained for the lattice coefficients (see Problem 
10.39). Using these updatings, we obtain the a priori RLS lattice-ladder algorithm with 
error feedback shown in Table 10.17. 


TABLE 10.17 

Computational organization of a priori RLS lattice-ladder 
algorithm with direct updating of its coefficients using 
error feedback formula. 


Equation Computation 


Time initialization 


Ef (-1) = E>(-1) =5>0 
K(-1) = kB (-1) = 0 
e(-1)=0 kK (-1) =0 


Order initialization 


(a) eh(n)=eh(n)=x() eg) = ym) — ayn) = 1 


Lattice part: m =0,1,...,M@—2 


(b) ef y(n) =e (n) + (a — Deb (n — 1) 
(c) e> 4 (n) =e n— 1) + kb*(n — Ief (n) 
(d) Ef (n) =AEL,(n— 1) tam — Diet, @|? 
(e) Ep (n) = AER (a = 1) + om(@)leb, (2)? 
m(n — 1)e4,(n — 1)eF* 
(f) k(n) = KE (n 1) Am(n ui (n em+1 7) 
EB (n— 1) 
Am(n — De (nyeb*, (2) 
(g) kh (n) = kf, (a — 1) -— 
Em(n) 
b 2 
(h) Am+1(N) = Am(n) Ba me 
Ladder part: m= 0,1,...,M—-1 
@) em41(N) = em(n) — kE*(n — Leb, (n) 
am(nje? (nye, (0) 
(i) Kin (n) = kj (n — 1) 4 oo 
En(1) 


We note that we first use the coefficient kf, (mn — 1) to compute the higher-order error 
€m+1(n) by (10.7.49) and then use that error to update the coefficient using (10.7.51). This 
updating has a feedback-like structure that is sometimes referred to as error feedback form. 
An a posteriori form of the RLS lattice-ladder algorithm with error feedback can be easily 


obtained as shown in Problem 10.40. Simulation studies (Ling et al. 1986) have shown that 


when we use finite-precision arithmetic, the algorithms with direct updating of the lattice 
coefficients have better numerical properties than the algorithms with indirect updating. 


10.7.4 Givens Rotation—Based LS Lattice-Ladder Algorithms 


We next show how to implement the LS lattice-ladder computations by using the Givens 
rotation (see Section 8.6) with and without square roots. The resulting algorithms explore 
the shift invariance of the input data to reduce the computational complexity from O(M7) 
to O(M) operations (Ling 1991; Proudler et al. 1989). 

We start by introducing the angle normalized errors 


Em(n) = Vem(n)Em(N) = em(n)V/ am (n) (10.7.52) 
a(n) © Jef (nyef,(n) = ef, (2) am (n — 1) (10.7.53) 


& (n) & ,/eb (nye, (n) = e°, (n) Vom (n) (10.7.54) 


which are basically the geometric mean of the corresponding a priori and a posteriori errors 
[see the discussion following (10.5.24) for the interpretation of a, (”) as an angle variable]. 
If we formulate the LS problem in terms of these errors, we do not need to distinguish 
between a priori and a posteriori error algorithms. 

Using the a priori lattice equation (10.7.47) for the forward predictor and the definitions 
of the angle normalized errors, we obtain 


of _ | Om41(2 — 1) 2 Bat — Am+1(n — - ep (n — 1) 
emi (1) = €m (1) 
Qm(n — 1) E® (n —2) Am(n — E> (n — 2) 
or by using (10.7.30) 


b _ ‘ #5 
&i@) = [ea 2 Fy Se ee (10.7.55) 
Ene 1) [Exim 2 (Eh - 1) 
If we define the quantities 
bin — 
ain) 4 ier (10.7.56) 


=b 
s(n) & nM) (10.7.57) 
E> (n) 
and aA ee BuO kf (n),/ Eb (n — 1) (10.7.58) 
Eb (n - 1) 
we obtain 
Ein) =n — en) + VA5% (n — DRE — 1) (10.7.59) 


which provides the order Ae of the angle normalized ee prediction error. 
To obtain the update equation for the normalized coefficient a (n), we start with 


Bin (n) = AB, (2 — 1) +an(n — le? (n — Del*(n) (10.7.60) 
and using (10.7.58), (10.7.53), and (10.7.54), we obtain 


$Me hee sie =e 
ss aid no as a, a 
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or finally 
kn) = Va@Q (rn — Din — 1) — 5*n — Ne) (10.7.61) 
with the help of (10.7.56) and (10.7.57). 


Using the a priori lattice equation (10.7.48) for the backward predictor and the defini- 
tions of the angle normalized errors, we obtain 


0 (ny =| Smt) 536 Gy) Bm =D fame) enn) 
Ee Am (n — 1) 3 [Ef (n — 1) Qm(n — 1) [Ef (n—1) 


f 
or B= a ae De me 1) Lae én (M) (10.7.62) 
Ej) {EL Gia) (EL no 
by using (10.7.29). If we define the quantities 
fies” 
clin) a (10.7.63) 
sf ny 4 En) (10.7.64) 
S Dn o/. 
m ‘rf (n) 
b 
and k(n) 4 Pm k (n),/ Ef (n) (10.7.65) 
Ej, (n) 
we obtain 
2 y(n) = & (ne, (n — 1) + VAST, (n(n — 1) (10.7.66) 


which provides the update equation for the angle normalized backward prediction error. 
The updating of a (n) is given by 

zb ~f (\pb xf xb 

k(n) = Vre, (n)kp*(n — 1) — 5," (nye? (n — 1) (10.7.67) 
and can be easily obtained, like (10.7.61), by combining (10.7.60) with (10.7.63) through 
(10.7.65). 


Similar updatings can be easily derived for the ladder part of the filter. Indeed, using 
(10.7.49), the definitions of the angle normalized errors, and (10.7.30), we have 


SienGn = AEP (n — 1), (n) a(n — 1) xen) 
V  Ej@) Jes _j(n=1) {Eb (n) 


or @m+i(n) = 2 (n)Em(n) — Vas (n)ko*(n — 1) (10.7.68) 
Brn (1) 
is a normalized ladder coefficient. This coefficient can be updated by using the recursion 
ke, (n) = Sac, (n)k°, (n—1)+5 > (nye m (it) (10.7.70) 


which can be obtained, like (10.7.61) and (10.7.67), by using (10.7.45) and related defini- 
tions. 
If we define the normalized energies 


where ko (n) & = k°(n),/ E® (n) (10.7.69) 


El an) &,/Et (n) (10.7.71) 


and E> (n) = ,/ E> (n) (10.7.72) 


we can easily show, using (10.7.41) and (10.7.42), that 


EN (an) = Vad EL — 1) + 3 men) (10.7.73) 
and E> (n) = Vae® (n)E? (n — 1) +55 (n)e*(n) (10.7.74) 


which provide time updates for the normalized minimum energies. However, the following 
recursions 


Ef (n) = (EL (an — DP +1, @) Py? (0.7.75) 


EX) = AE — DP +12 @r?y}? (10.7.76) 


obtained from (10.7.41) and (10.7.42), provide more convenient updatings. 
We now have a complete formulation of the LS lattice-ladder recursions using angle 
normalized errors. To see the meaning and significance of these recursions, we express them 


in matrix form as 

sf xb, =by, _ =f 

emi") a en) (10.7.77) 
ki (n) —3(n—-1) @&m—1)] | vai — 1 


Ft OD) : Emn() SC) a 1) aus 
k>*(n) —5t*(n) oe (n) NA an (n—1) 

€m+1 (2) en) =—s (n) | | ém(n) 

~ = : 10.7.79 
E (n) ee cb (n) ea z | ( ) 


where we see that the updating of the forward predictor parameters and the ladder param- 
eters involves the same matrix delayed by one sample. The different position of the minus 
sign, due to the different sign used in the definitions of kf (n) and ke (n), is immaterial. 
Furthermore, it is straightforward to show that 


|Z, (n)|? + 154, (n)|? = 1 (10.7.80) 
and |e (n)|? + [5° (n)|? = 1 (10.7.81) 


which imply that the matrices in (10.7.77) through (10.7.79) are the Givens rotation matrices. 
Therefore, we have obtained a formulation of the LS lattice-ladder algorithm that updates 
the angle normalized errors and a set of normalized lattice-ladder coefficients using the 
Givens rotations. Using (10.7.76) and definitions of eo: (n) and 5b (n), we can show that 


Eb(n)| [i Syt(n) |] [VRE — 1D) 
0 ~ L-ab@) 


en) | Le (n) 
which shows that we can use the BLP Givens rotation to update the normalized energy 
E is (n). A similar transformation can be obtained for ee (n). However, the energy updatings 
are usually performed using (10.7.75) and (10.7.76). 

The square root—free version of the Givens LS lattice-ladder filter is basically a simple 


modification of the error feedback form of the a priori LS lattice-ladder algorithm. Indeed, 
using (10.7.50), we have 


(10.7.82) 


NEP (n — 1) 
Ej, (n) 


im (n)e> (n) 
EP (n) 


ko (n) = k(n—1+ e* (n) 


or if we define the quantities 


AEP (n-1 
a ES I) 2 = |t (10.7.83) 


b 
Cm (1) = EB @ 


@m(n)e?, (n) 


d bin) 
an Sm (1) EB (n) 


(10.7.84) 
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TABLE 10.18 


we obtain k& (n) = c2 (n)kS(n — 1) +: 5° (n)e* (n) (10.7.85) 


which provides the required updating for the ladder parameters. 
Similarly, using the error feedback a priori updatings for the lattice parameters, we 
obtain the recursions 


ken) =P — k(n — 1) — s(n — Del*(n) (10.7.86) 
and k(n) = ch (nk? (n — 1) — sh n)e™*(n — 1) (10.7.87) 
,AEL@—-1) . 
where cl (n) = EEG) = |e (ny? (10.7.88) 
a Om(n — ef, (n) 
and sin) £ EE ny (10.7.89) 


are the forward rotation parameters. These recursions constitute the basis for the square 
root—free Givens LS lattice-ladder algorithm. 

Table 10.18 provides the complete computational organizations of the Givens LS 
lattice-ladder algorithms with and without square roots. The square root algorithm is ini- 


Summary of the Givens LS lattice-ladder adaptive filter algorithms. 


Equation Square root form Square root-free form 
Forward rotation parameters 
(a) EL@) = (Eh a — DP + la, @ 17}! Ej, (n) = AE}, — 1) tom — Diet, I? 
7 ViEt in —1) rE! (n — 1) 
(b) ef, (n) = —_4 Cy (n) = — 
Em() Em() 
of f 
—~Le 
©) #0) = a sf (ny = THO Dem) 
Em(n) Em(n) 
Backward Rotation Parameters 
(d) ER (a) = (LEB, @ — DP + le, @ 7}! EP (n) = AEB (n — 1) + am (n) |e, (0)? 
es VAER@ — 1) ‘ AED (n — 1) 
(e) Cm (1) = hice Cy (1) = a 
Em(n) Enm() 
sb b 
: a em (1) Am (n)e> (n) 
i Bin) = dupa a 
Em(n) Em(n) 
Forward predictor rotator 
(g) Fy M=Aa-VYesa+vaPa-DEa-)  &b(n) = enn) +a (a eb (n — 1) 
(h) Mn) = Va an — DE — 1) — 5>*(n — Det) on) = 2 (n— DkEL a — 12) — sh a — Delt) 
Backward predictor rotator 
(i) a y(n) = GF, (neg, (a — 1) + VAR, (a)ROF CH = 1) ey) =e 1) + (= Nef, (n) 
(/) k* (ny = SEE, ny kb* (2 — 1) — SE ye (n — 1) k(n) = ch ke — 1) — sh (eb*(n — 1) 
Filter rotator 
(k) Em-1 (2) = Sy (Mem (n) — SASH, (n)k GE (n — 1) em+1(2) = em(n) — kG*(n — Nep, (n) 
) KE (n) = SAC (ny kS (n — 1) + 5° je (n) KE (n) = c® (n)kS,(n — 1) +s (nex (n) 


tialized as usual with E},(—1) = E> (-1) =6 > 0, a(n) = &(n) = x(n), éo(n) = y(n), 
ao(n) = 1, and all other variables set to zero. The square root—free algorithm is initialized 
as the a priori algorithm with error feedback. Figure 10.39 shows a single stage of the LS 
lattice-ladder filter based on Givens rotations with square roots. 


wf 
emai ™) 


~b 
mai) 


Em41() 


FIGURE 10.39 

Block diagram representation of the Givens RLS lattice-ladder stage. Circles denote 
computing elements that calculate the rotation parameters and squares denote 
computing elements that perform the rotations. 


10.7.5 Classification of RLS Algorithms for FIR Filtering 


Every exact RLS algorithm discussed in this section consists of two parts: a part that 
computes the LS forward and backward predictors of the input signal and a part that uses 
information from the linear prediction part to compute the LS filter. In all cases, information 
flows from the prediction part to the filtering part, but not vice versa. Therefore, all critical 
numerical operations take place in the linear prediction section. 

For direct-form structures, the prediction problems facilitate the fast computation of 
the RLS gain vectors. 

In the case of lattice-ladder structures, the lattice part (which again solves the linear 
prediction problem) decorrelates (or orthogonalizes in the LS sense) the input signal vector 
and creates an orthogonal base consisting of the backward prediction errors {e> (n)}i" nn 
This orthogonal basis is used by the ladder part to form the LS filtering error. Essentially, 
the LS lattice part facilitates the triangular UDL decomposition of the inverse correlation 
matrix R~! (n) or the Gram-Schmidt orthogonalization of the columns of data matrix X(7). 
This property makes the RLS lattice-ladder algorithm order-recursive, like its minimum 
MSE counterpart (see Section 7.3). 

The QRD-RLS lattice-ladder algorithms also consist of a lattice part that solves the 
linear prediction problem and a ladder part that uses information from the lattice to form the 
LS filtering estimate. The LS lattice produces the triangularization of the inverse correlation 
matrix R~! (n) whereas the QRD LS lattice produces the upper triangular Cholesky factor 
of R(n) by applying an orthogonal transformation to data matrix X(7). 

The correspondence of these algorithms to their counterparts for RLS array processing, 
discussed in Section 10.6, is summarized in Figure 10.40. 
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FIGURE 10.40 


Classification of RLS algorithms for array processing and FIR filtering. 


It is interesting to note that the RLS lattice-ladder algorithms with error feedback are 
identical in form to the square root—free Givens rotation—based QRD-RLS lattice-ladder 
algorithms. This similarity explains the excellent numerical properties of both structures. 

The RLS lattice-ladder algorithms (both UDL”-decomposition based and QR- 
decomposition based) share the following highly desirable characteristics: 


¢ Good numerical properties that originate from the square root decomposition (Cholesky 
or QR) part of the algorithms. 

¢ Good convergence properties, which are inherited from the exact LS minimization per- 
formed by all algorithms. 

e Modularity and regularity that make possible their VLSI and multiprocessing implemen- 
tation. 


It has been shown (Ljung and Ljung 1985) that all RLS lattice-ladder algorithms are nu- 
merically stable for A < 1. However, they differ in terms of numerical accuracy. It turns 
out that the lattice-ladder algorithms with error feedback (which are basically equivalent to 
the square root—free QRD lattice ladder) and the QRD lattice-ladder algorithms have the 
best numerical accuracy. 


10.8 TRACKING PERFORMANCE OF ADAPTIVE ALGORITHMS 


Tracking of a time-varying system is an important problem in many areas of application. 
Consider, for example, a digital communications system in which the channel characteristics 
may change with time for various reasons. If we want to incorporate an echo canceler in 
such a system, then clearly the echo canceler must monitor the changing impulse response 
of the echo path so that it can generate an accurate replica of the echo. This will require the 
adaptive algorithm of an echo canceler to possess an acceptable tracking capability. Similar 
situations arise in adaptive equalization, adaptive prediction, adaptive noise canceling, and 
so on. In all these applications, adaptive filters are forced to operate in a nonstationary SOE. 
In this section, we examine the ability and performance of the LMS and RLS algorithms to 
track the ever-changing minimum point of the error surface. 

As discussed earlier, the tracking mode is a steady-state operation of the adaptive 
algorithm, and it follows the acquisition mode, which is a transient phenomenon. Therefore, 
the algorithm must acquire the system parameters before tracking can commence. This has 
two implications. First, the rate of convergence is generally not related to the tracking 


behavior, and as such, we analyze the tracking behavior when the number of iterations 
(or steps) is relatively large. Second, the time variation of the parameter change should be 
small enough compared to the rate of convergence that the algorithm can perform adequate 
tracking; otherwise, it is constantly acquiring the parameters. 


10.8.1 Approaches for Nonstationary SOE 


To effectively track a nonstationary SOE, adaptive algorithms should use only local statis- 
tics. There are three practical ways in which this can be achieved. 


Exponentially growing window 


In this approach, the current data are artificially emphasized by exponentially weighting 
past data values, as shown in Figure 10.41(a). The error function that is minimized is given 
by 

n 
E(n) = YO yG) — 8 x(P? = AEM —-D+ly@)—e#x~)? (10.8.1) 
j=0 
where 0 < 4 < 1. Clearly, this is the cost function we used in the development of the RLS 
algorithm, given in Table 10.6, in which A is termed the forgetting factor. The effective 
window length is given by 


Let = — = 10.8.2 
eff 50 i= ( ) 


Hence for good tracking performance i should be in the range 0.9 < A < 1. Note that 
dX. = | results in a rectangularly growing window that uses global statistics and hence will 
not be able to track parameter changes. Thus the RLS algorithm with exponential forgetting 
is capable of using the local information needed to adapt in a nonstationary SOE. 


wr 
J J 
0 n 0 n 
rte j 
J J 
0 n+1 0 n+1 
\nt2-j 
0 n+2 0 n+2 
(a) Exponentially growing window (b) Fixed-length sliding window 


FIGURE 10.41 
Illustration of exponentially growing and fixed-length sliding windows. 
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Fixed-length sliding window 


The basic feature of this approach is that the parameter estimates are based only on a 
finite number of past data values, as shown in Figure 10.41(b). Let us consider a rectangular 
window of fixed length L > M. Then the cost function that is minimized is given by 


n 


EQ,L)2 YY lyG)—e* xl? (10.8.3) 
jo=n-L41 


When anew data value at+ 1 is added to the sum in (10.8.3), the old data value is discarded, 
that is, all old data values beyond n — L + 1 are discarded. Thus the active number of data 
values is always a constant equal to L, which makes this as a constant-memory adaptive 
algorithm. By following the steps given for the RLS adaptive filter in Section 10.5, it is 
possible to derive a recursive algorithm to determine the filter c(7) that minimizes the error 
function in (10.8.3). 

Let ¢j,-1}(m — 1) denote the estimate of c(n — 1) based on L data values between 
n — L and n — 1. After the new data value at n is observed, the RLS algorithm in Table 
10.6 is applicable with A = | and with obvious extension of notation. Hence we obtain the 
algorithm 


C{n—1}(2) = Cin—1) (0 — 1) + Bin—Ly (nJe*(n) (10.8.4) 

e(n) = y(n) — eff_1)(n — 1)x(n) (10.8.5) 
Zin-L}(n) = a (10.8.6) 
B(n—-L} (7) = Pm—zj (2 — I)x(m) (10.8.7) 
Oin—L\(n) = 1+ B11) (n)x(n) (10.8.8) 
Pin—1) (0) = Pin—1y (0 — 1) ~ gin eft_p) (0) (10.8.9) 


The above algorithm is based on L + | data values. To maintain the data window at fixed 
length L, we have to discard the observation at n — L. By using the matrix inversion lemma 
given in Appendix A, it can be shown that (see Problem 10.51) 


Cin—L41} (2) = Cin—1}(2) — Sm—L41)(ne*(n — L) (10.8.10) 

e(n—L) =y—L)—eff_j)(a)x(n — L) (10.8.11) 
Bin—L+1)(n) = see (10.8.12) 
Bin—L4.1}(2) = Pin—1)(n)x(n — L) (10.8.13) 
n—L4y) = 1- BF 74 @)x( — L) (10.8.14) 
Pin—241) (2) = Pin—1)() + Bin—L4 MBE 24.1) (10.8.15) 


The overall algorithm for the fixed-memory rectangular window adaptive algorithm is given 
by (10.8.4) through (10.8.15), which recursively update ¢,,—7}(n — 1) to ¢t,—141;(”). Thus, 
this algorithm can adapt to the nonstationary SOE using the local information. The fixed- 
length sliding-window RLS algorithm can be implemented by using a combination of two 
prewindowed RLS algorithms (Manolakis et al. 1987). 


Evolutionary model—Kalman filter 


In the first two approaches, adaptation in the nonstationarity SOE was obtained through 
the local information, either by discarding old data or by deemphasizing it. In the third 
approach, we assume that we have a Statistical model that describes the nonstationarity 


of the SOE. This model is in the form of a stochastic difference equation together with 
appropriate statistical properties. This leads to the well-known Kalman filter formulation 
in which we assume that the parameter variations are modeled by 


c(n) = B(n)e(n — 1) + v(n) (10.8.16) 


where v() is a random vector with zero mean and correlation matrix Z(n), and &(n) is 
the state-transition matrix known for all n. The desired signal y(n) is modeled as 


y(n) = e# (n)x(n) + €(n) (10.8.17) 


where é€(7) is the a posteriori estimation error assumed to be zero-mean with variance o2. 


Thus in this formulation, the parameter vector c(7) acts as the state of a system while the 
input data vector x(7) acts as the time-varying output vector. Now the best linear unbiased 
estimate ¢(n) of ¢(n) based on past observations {y(i ro can be obtained by using the 
Kalman filter equations (Section 7.8). These recursive equations are given by 


En) = EME — 1) + g@Ly(n) — E48 (n — NEF ()x(n)] (10.8.18) 

= (n)P(n — 1)x(n) 
o2 + x4 (n)P(n — 1)x(n) 
P(n) = E(n)P(n — 1)E4%7 (n) + D(n) (10.8.20) 
7 x(n)x" (n) 

o2 + x4 (n)P(n — 1)x(n) 
where g(7) is the Kalman gain matrix and P(7) is the error covariance matrix. This approach 
implies that if the time-varying parameters are modeled as state equations, then the Kalman 
filter rather than the adaptive filter is a proper solution. 


Furthermore, it can be shown that the Kalman filter has a close similarity to the RLS 
adpative filters if we make the following appropriate substitutions: 


(10.8.19) 


g(n) = 


—E(n)P(n P(n — 1)E#(n) 


Exponential memory: If we substitute 


1-2’ 
narra | 


E(qn)=I oF =A WH= I — g(n)x" (n)JP(n — 1) (10.8.21) 


then we obtain the exponential memory RLS algorithm given in Table 10.6. 
Rectangularly growing memory: If we substitute 


E(n)=I o2=1 (n)=0 (10.8.22) 


é 


then we obtain the rectangularly growing memory RLS algorithm. 


10.8.2 Preliminaries in Performance Analysis 


In Sections 10.4 and 10.5.4, we developed and analyzed the LMS and RLS algorithms 
in stationary environments, respectively. However, these algorithms are generally used 
in applications (e.g., modems) that are intended to operate continuously in SOE whose 
characteristics change with time. Therefore, we need to discuss the performance of these two 
widely used algorithms in such situations. Although we provided various adaptive filtering 
approaches for time-varying environments above, we now discuss, in the remainder of this 
section, the ability of these two algorithms to track time-varying parameters. We provide 
both analytical results, assuming a model of parameter variation, and experimental results, 
using simulations. 

A popular approach for this analytical assessment is to assume a first-order AR model 
with finite variance [that is we set &(n) = pI in (10.8.16)]. Although higher-order models 
are also possible, only a few results on the tracking performance using these models are 
currently available. It is ironic that most analytical results on the tracking performance have 
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been obtained for the random-walk model (a special case of the first-order AR model), 
which is unrealistic because of the infinite variance. A tutorial review of the latest results 
for the general case and additional references are available in Macchi (1996). 

In our analysis of tracking characteristics of the LMS and RLS algorithms, we use 
the first-order AR model and discuss its effect on the tracking performance. The closed- 
form results will be given using the random-walk model and confirmed using simulated 
experiments. 


Analysis setup 
In the tracking analysis, it is desirable to use the a priori adaptive filter. Hence we 
assume that the desired response is generated by the following filter model’ 
y(n) = cH (n — 1)x(n) + v(n) (10.8.23) 


where vu(7) is assumed to be WGN(0, a7) with at < oo. The random processes x(n) and 
u(n) are assumed to be independent and stationary. The variation of ¢,(”) is modeled by 
the first-order AR (or Markov) process 


Co(n) = peo(n — 1) + Wn) (10.8.24) 


with 0 < p < 1 and creates the nonstationarity of the SOE. The quantity w(7) is the 
uncertainty in the model and assumed to be independent of x(n) and v(n), with mean 
E{w(n)} = 0 and correlation E (win)w! (n)} = Ry. Tracking is generally achievable if 
p is close to 1. The random-walk model is obtained by using p = 1| in (10.8.24). 

Conjugate transposing and premultiplying both sides of (10.8.23) by x(m), taking the 
expectation, and using independence between x() and v(7), we obtain 


Re,(n — 1) = d(n) (10.8.25) 
Hence, ¢,(n — 1) is the optimum a priori filter and 
€o(n) = y(n) — cH (n — 1I)x(n) = v(n) (10.8.26) 


is the optimum a priori error. If Ry = 0 and p = 1, we have ¢g(n) = Cp for all n, and 
therefore y(7) is wide-sense stationary (WSS). In this case, we have a stationary environ- 
ment, and the goal of the adaptive filter is to find the optimum filter c,. For Ry 4 0, the 
adaptive filter should find and track the optimum a priori filter c,(n). This setup, which is 
widely used to analyze the properties of adaptive algorithms, is illustrated in Figure 10.42. 


Assumptions 


To analyze the tracking performance of adaptive algorithms, we use the assumptions 
discussed elsewhere and repeated below for convenience. 


Al The sequence of input data vectors x(n) is WGN(0, R). 
A2_ The desired response y(n) can be modeled as 


y(n) = cH(n — 1)x(n) + e,(n) (10.8.27) 
where e,(n) is WGN(0, 02). 
A3_ The time variation of c,(n) is described by 
Co(n) = peo (n — 1) + Wn) (10.8.28) 


where 0 < p < 1 and wW(n) is WGN(O, Ry). 
A4_ The random sequences x(n), eg(n), and w(n) are mutually independent. 


Through these assumptions, we want to stress that the nonstationarity of the SOE is created 
solely by ¢,() and not by x(7), which is WSS. 


"We use this model to make a fair comparison between the adaptive and the optimum filter. 


p w(n—1) FIGURE 10.42 595 


Block diagram of the setup and SECTION 10.8 
model used for the analysis of Tracking Performance of 
adaptive algorithms. Adaptive Algorithms 


v(n) = e,(n) 


Although we provide analysis for (10.8.27), many results are given for the random walk 
model (p = 1). The case 0 < p < 1, whichis straightforward but complicated, is discussed 
in Solo and Kong (1995). Before we delve into this analysis, we discuss criteria that are 
used for evaluating the tracking performance. 

Degree of nonstationarity 


To determine whether an adaptive algorithm can adequately track the changing SOE, 
one needs to define the speed of variation of the statistics of the adaptive filter environment. 
This speed is quantified in terms of the degree of nonstationarity (DNS), introduced in 
Macchi (1995, 1996), and is defined by 


E o,incer Z 
nin) & het (10.8.29) 


where Yo,iner (1) = [€o(n) — Co(n — 1)]7x(n) (10.8.30) 


is the output of the incremental filter. The numerator is the power introduced by the variation 
of the optimum filter, and the denominator is the MMSE, which in the context of (10.8.26) 
is equal to the power of the output noise. Assuming p = | in (10.8.28), we see that (10.8.30) 
is given by 


Yo,iner(n) = W"x(n) 
and hence the numerator in (10.8.29) is given by 
E{|yo,iner(n)|?} = E{W" x(n)x" (n)W} = te LE{W" x(n)x” (1) W}] 
= tr[E{WW" x(n)x" 1h] = tl E(WW" }E{x(n)x”}] (10.8.31) 
= tr[RyR] = tr[RRw] 


where we have used the independence assumption A4. Substituting (10.8.31) in (10.8.29), 
we obtain 


n(n) = .| ——— (10.8.32) 


Smaller values of 7 (<« 1) imply that the adaptive algorithm can track time variations of 
the nonstationary SOE. On the contrary, if 7 > 1, then the statistical variations of the 
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SOE are too fast for the adaptive algorithm to keep up with the SOE and lead to massive 
misadjustment errors. In such situations, an adaptive filter should not be used. 


Mean square deviation (MSD) 

We defined the MSD D(n) in (10.2.29) as a performance measure for adaptive filters 
in the steady-state environment. It is also used for measuring the tracking performance. 
Consider the coefficient error vector ¢(n), which can be written as 

C(n) = e(n) — €o(n) 
[e(n) — E{e(n)}] + [E{e(n)} — eo(n)] (10.8.33) 
= ¢1(n) + C2(n) (10.8.34) 


> Il 


where C1 (7) is the fluctuation of the adaptive filter parameter vector about its mean (estima- 
tion error) and €2() is the bias of e(m) with respect to the true vector ¢,(7) (systematic or 
lag error). Using the independence assumption of the previous section that x(n) and e(n — 1) 
are statistically independent, we can show that (Macchi 1996) 


E{@{! (n)ex(n)} = 0 (10.8.35) 
which by using (10.2.29) and (10.8.34) leads to 
D(n) = Di (n) + D2(n) (10.8.36) 


The first MSD term is due to the parameter estimation error and is called the estimation 
variance. The second MSD term is due to the parameter lag error and is termed lag variance, 
and its presence indicates the nonstationary environment. 

Misadjustment and lowest excess MSE 


The second performance measure, defined in (10.2.38), is the (a priori) misadjustment 
M(n), which is the ratio of the excess MSE Px (1) to the MMSE P,(n). The a priori excess 
MSE is given by 


Pox(n) = E{\é4 (n — 1)x(n)|?} = Ef\e? @ — Dx(n) + 4 m — 1)x(n) 7} (1.8.37) 
which under the independence assumption and (10.8.35) can be written as 
Pex(n) = Pex,1 (a) + Pex,2(”) (10.8.38) 


where the first term, Pex,1(”), is excess MSE due to estimation error and is termed the 
estimation noise while the second term, Pex,2(1), is the excess MSE due to lag error and is 
called the lag noise. Therefore, we can also write the misadjustment M (n) as 


M(n) = Mi (n) + Ma(n) (10.8.39) 


where M(n) is the estimation misadjustment and M?(n) is the lag misadjustment. 

In the context of the first-order Markov model, the best performance obtained by any 
a priori adaptive filter occurs if c(7) = peo(n — 1). This observation makes possible the 
computation of a lower bound for the excess MSE of any a priori adaptive algorithm. From 
(10.8.34) and (10.8.24), we have 


e(n) - e(n) = ¢o(n) = [e(n) — peo(n — 1)]— wn) aaa 
= @(n) — (n) 
and hence 
Pex (n) = E{|@" (n — 1)x(n)|?} 
= E{\é" (n — 1)x(n) — w4 (n — 1)x(n)|7} (10.8.41) 
= E{\e" (n — 1)x(n) 7} + flv" — Dx@m)/7} 
+2E{é" (n — 1)x(n)x" (n)v(n — 1} (10.8.42) 


Since the term ¢(n) does not depend on w(n) and since the random sequences x(n) and 
w(n — 1) are assumed independent, the last term in (10.8.42) is zero. Hence, 


Pex(n) > E{iw" (n — 1)x(n)|7} (10.8.43) 


which provides a lower bound for the excess MSE of any a priori adaptation algorithm. 
Because w(n) and x(n) are assumed independent, we obtain 


E{|v" (n — 1)x(n)|7} = tr(RRy) (10.8.4) 
Similarly, neglecting the dependence between x(n) and ¢(n — 1), we have 
E{\e# (n — 1)x(n) 7} = [ROC — 1)] (10.8.45) 


which provides the a priori excess MSE. Furthermore, it can be shown that the DNS places 
a lower limit on the misadjustment, that is, 
P E{\w@a-1 "1 (RR 

el) EW n= Dx@P}_ &RRV) _ 2, ag 9.46) 
Po(n) Po(n) o? 


Vv 


M(n) = 


10.8.3 LMS Algorithm 


Using the LMS algorithm (10.4.12), the error vector in (10.8.34), and the Markov model in 
(10.8.28) with o = 1, we can easily obtain 


&(n) = [I — 2x(n)x" (n)]@(n — 1) + 2ux(n)e*(n) — ¥(n) (10.8.47) 


which, compared to (10.4.15), has one extra input. Since x(n), e,(n), and w (n) are mutually 
independent, y(n) adds only an extra term oil to the correlation of ¢(n). 


Misadjustment. To determine the misadjustment, we perform orthogonal transforma- 
tion of the correlation matrix of ¢(n). When we transform (10.4.28) to (10.4.30), using the 
orthogonal transformation (10.4.29), the presence of the diagonal matrix oil changes only 


the diagonal components with the addition of the term oF. Indeed, we can easily show that 
Ox(n) = pyOx(n — 1) + 47D Pex(n — 1) + 4? Pode +04, (10.8.48) 


where P,(n) = Po = Ge for large n. Clearly, (10.8.48) converges under the same conditions 
as (10.4.40). At steady state we have 


04 (00) = pyOk (00) + 4U7 Ag Pex (00) + 4p? Podk + 04, (10.8.49) 
or using (10.4.36), we have 


2 

Po + Pex (oo) 1 oy 
0 = 10.8.50 
OSE os Me ee ( ) 

which in conjunction with (10.4.55) and (10.4.56) gives 
Cu) 2, 1 DW 2 
P = 10.8.51 
BO =O) ACU © er 
a. 

h D & ———_—_ 10.8.52 
where (Ww) 2 Te (10.8.52) 
If wAr « 1, we have C(w) ~ wtr(R) and D(w) ~ M, which lead to 

1 
Pex (00) & poy, tr(R) + —Moy, (10.8.53) 
4 
2 
1 oy 
or M (co) & w tr(R) + aM (10.8.54) 
4u 0 os 
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Hence in the steady state, the misadjustment can be approximated by two terms. The first 
term is estimation misadjustment, which increases with 4, while the second term is the 
lag misadjustment, which decreases with w. Therefore, an optimum value of jz exists that 
minimizes M(oo), given by 


~ ow | MM (10.8.55) 
Mort = 960 Vi r(R) i 
Myin(00) ~ * /M wR) (10.8.56) 


v 


MSD. To determine the MSD, consider (10.8.47). For small step size jz, the system 
matrix [I — 2ux(n)x! (n)] is very close to the identity matrix. Hence using the direct 
averaging method due to Kushner (1984), we can obtain a close solution of ¢(n) by solving 
(10.8.47) in which the system matrix is replaced by its average [I — 2uR], that is, 


e(n) = [I — 2uR]e(m — 1) + 2ux(n)es(n) — wn) (10.8.57) 


where we have kept the same notation. Taking the covariance of both sides of (10.8.57), we 
obtain 


@(n) = [I — 2uR] O(n — 1)[I — 2uR] +. 407o2R + Ry (10.8.58) 
The approximate steady-state solution of (10.8.58) is given by 
2 Ry 
R® + ®R ~ 2vojR+ or (10.8.59) 
LL 


where the second-order term 4;.7R ®R is ignored for small values of jz. After premultiplying 
(10.8.59) by R7!, we obtain 


a4 2, R'Ry 
®+R OR & 2yo}, + 5 (10.8.60) 
lL 
Taking the trace of (10.8.60) and using tr(R~! ®R) = tr(®), we obtain 
tr(R7'R 
tr(®) ~ pMo2 + eal (10.8.61) 


Ay 
By following the development in (10.8.28), it can be shown that (Problem 10.52) D(co) = 
tr(®). Hence 


tr(R-'Ry) 
Ay 
As expected, the MSD has two terms: The estimation deviation is linearly proportional to 


lt while the lag deviation is inversely proportional to jz. The optimum value of the step size 
4 is obtained when both deviations are equal and is given by 


ot r(RRy) (10.8.63) 
poe 5 Mo? - 


or Dmin(00) = ,/ Mo? tr(R~!Ry) (10.8.64) 


EXAMPLE 10.8.1. To study the tracking performance of the LMS algorithm, we will simulate 
a slowly time-varying SOE whose parameters follow an almost random-walk behavior. The 
simulation setup is shown in Figure 10.42 and given by (10.8.27) and (10.8.28). The simulation 
parameters are as follows: 


D(oo) ~ wMo* + (10.8.62) 


—0.8 
Co(n) model parameters: Co (0) = 0 | M=2 p = 0.999 


¥(n) ~WGNOO,Ry) — Ry = 0.01)71 
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For these values, the degree of nonstationarity from (10.8.32) is given by 


tr/RR 
toe EE aga 4 
Oo 


Vv 


which means that the LMS can track the time variations of the SOE. 

Three different adaptations (slow, matched, and fast) of the LMS algorithm were designed. 
Their adaptation results are shown in Figures 10.43 through 10.48. From (10.8.55) and (10.8.63), 
the optimum performance is obtained when 


Hopt = 0.05 


for which M pin(00) = 0.2 and Dpin(O©) = 0.002. Hence, the following values for j4 were 
selected for simulation: 


Slow: w=0.01 
Matched: w=0.1 
Fast: pw =0.3 


Figure 10.43 shows the matched adaptation of parameter coefficients while Figure 10.44 shows 
the resulting D(n) and M(n). Clearly, the LMS tracks the varying coefficients nicely with 
expected small misregistration and deviation errors. Figure 10.45 shows the slow adaptation of 
parameter coefficients while Figure 10.46 shows the resulting D(n) and M(n). In this case, 
although the LMS algorithm tracks with bounded error variance, the tracking is not very good 
and the resulting misregistration errors are large. Finally, Figure 10.47 shows the fast adaptation 
of parameter coefficients while Figure 10.48 shows the resulting D(n) and M(n). In this case, 
although the algorithm is able to keep track of the slowly varying coefficients, the resulting 
variance is large and hence the estimation errors are large. Once again, the total errors are large 
compared to those for the matched case. 


10.8.4 RLS Algorithm with Exponential Forgetting 
Consider again the model given in Figure 10.42 and described in the analysis setup. 


Misadjustment. To determine the misadjustment in tracking, we first evaluate the ex- 
cess MSE caused by lag, that is, by the deviation between F{c(n)} and the optimum a priori 
filter c,(). Combining 


e(n) = e(n — 1) + RO! (n)x(nye*(n) (10.8.65) 

with e*(n) = es (n) — x" (n)[e(n — 1) —e9(n — 1)] (10.8.66) 
and taking the expectation result in 

E{e(n)} = E{e(n — 1)} + E{R7!(n)x(n)x"” (n)}[Efe(n — 1)} — eo(n — 1] (10.8.67) 


because the expectation of R-! (n)x(n)e* (n) vanishes. Using the approximation E {Ro} (n)- 
x(n)x# (n)} ~ (1 — ADL, we have 


Gag (2) X A€lag(n) + €o(n — 1) — €o(n) (10.8.68) 


or Clag (2) ~ AClag(n — 1) — Wn) (10.8.69) 
for the random-walk (f = 1) model. The covariance matrix is 


Pjag(n) A? Miag(n — 1) + Ry (10.8.70) 
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FIGURE 10.43 
Matched adaptation of slowly time-varying parameters: LMS algorithm with 
w=0.1. 
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FIGURE 10.44 
Learning curves of LMS algorithm with matched adaptation. 
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FIGURE 10.45 
Slow adaptation of slowly time-varying parameters: LMS algorithm with 
w= 0.01. 
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FIGURE 10.46 
Learning curves of LMS algorithm for slow adaptation. 
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FIGURE 10.47 


Fast adaptation of slowly time-varying parameters: LMS algorithm with 
w= 0.3. 
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Learning curves of LMS algorithm for fast adaptation. 


and in steady state (assuming 0 < A < 1) 
1 
The lag excess MSE is 
1 1 
Pi; = tr[R® ~ — tr[RRy] ~ ————~ tr[RR 10.8.72 
lag (00) = tr[R®(co)] (a2 [RR ,y] a) [RR] ( ) 


because (1 — A)? = (1+ A)(1 —A) © 2(1 —A) fora ~ 1. 
The excess MSE due to estimation is [(1 — )/2|Mo?, hence the total excess MSE is 


1-2 9 2 
ifRy = oj I. Finally, the misadjustment is given by 
2 
1-A o;, tr(R) 
M(oo) ~ id (10.8.74) 


+ 
2 2(1 — A)o?2 
The first term in (10.8.74) is the estimation misadjustment, which is linearly proportional 
to 1 — 4, while the second term is the lag misadjustment, which is inversely proportional 
to | — A. The optimum value of 4 is given by 


Ow 1 
Aopt ~ 1 — — rT; tr(R) (10.8.75) 


and the minimum misadjustment is given by 


Min (00) -¥ Mtr(R) (10.8.76) 


v 


MSD. An analysis similar to the MSD development of the LMS algorithm can be done 
to obtain 


2; 
D(oo) ~ a A eR (10.8.77) 
aay SP Bish) = 
° * Oy 1 
with ht aaa (10.8.78) 
Oyo, = 
and Dmin(0o) ~ — V/tr(R-H (10.8.79) 


which again highlights the dependence of tracking abilities on A. 


EXAMPLE 10.8.2. To study the tracking performance of the RLS algorithm, we again simulate 
the slowly time-varying SOE given in Example 10.8.1 whose parameters are repeated here: 


—0.8 


M=2 p = 0.999 
0.95 


Co(n) model parameters: Co (0) = 


y(n) ~ WGNO, Ry) Ry = (0.01)71 
Signal x(n) parameters: x(n) ~ WGN(O, R) R=I 
Noise uv(n) parameters: v(n) ~ WGN(O, o°) oy=0.1 


For these values, the degree of nonstationarity is y(n) = 0.1414, which means that the RLS 
can track the time variations of the SOE. 

Three different adaptations (slow, matched, and fast) of the RLS algorithm were designed. 
Their adaptation results are shown in Figures 10.49 through 10.54. From (10.8.75) and (10.8.77), 
the optimum misadjustment performance is obtained when 


Aopt = 0.9 with Myin(oo) = 0.2 
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while from (10.8.78) and (10.8.79), the optimum deviation performance is obtained when 
Aopt = 0.93 with Dyin (Co) = 0.007 
Hence, the following values for 4 were selected for simulation: 


Slow: 24 = 0.99 
Matched: 4=0.9 
Fast: 4=0.5 


Figure 10.49 shows the matched adaptation of parameter coefficients while Figure 10.50 shows 
the resulting D(n) and M(n). Clearly, the RLS tracks the varying coefficients nicely with 
expected small misregistration and deviation errors. Figure 10.51 shows the slow adaptation of 
parameter coefficients while Figure 10.52 shows the resulting D(n) and M(n). In this case, 
although the RLS algorithm tracks with bounded error variance, the tracking is not very good 
and the resulting misregistration errors are large. Finally, Figure 10.53 shows the fast adaptation 
of parameter coefficients while Figure 10.54 shows the resulting D(n) and M(n). In this case, 
although the algorithm is able to keep track of the slowly varying coefficients, the resulting 
variance is large and hence the estimation errors are large. Once again, the total errors are large 
compared to those for the matched case. 
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FIGURE 10.49 
Matched adaptation of slowly time-varying parameters: RLS algorithm with 
A=0.9. 


10.8.5 Comparison of Tracking Performance 


When the optimum filter drifts like a random walk with small increment variance o7 , the 
tracking performance for the LMS algorithm is given by (10.8.54) and (10.8.62) while that 
for the RLS algorithm is given by (10.8.74) and (10.8.77). Whether the LMS or the RLS 
algorithm is better depends on matrices R and Ry. A general comparison is difficult to 
make, but some guidelines have been developed for particular cases. It has been shown that 
(Haykin 1996) 
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FIGURE 10.50 
Learning curves of RLS algorithm for matched adaptation. 
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FIGURE 10.51 
Slow adaptation of slowly time-varying parameters: RLS algorithm with 
A= 0.99. 
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FIGURE 10.52 
Learning curves of RLS algorithm for slow adaptation. 
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Fast adaptation of slowly time-varying parameters: RLS algorithm with 
A=0.5. 
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FIGURE 10.54 
Learning curves of RLS algorithm for fast adaptation. 


e When Ry = o41, then both the LMS and RLS algorithms produce essentially the same 
minimum levels of MSD and misadjustment. However, this analysis is true only asymp- 
totically and for slowly varying parameters (small o%,). 


e When Ry = aR where a is a constant, then the LMS algorithm produces smaller values 
of the minimum levels of MSD and misadjustment than the RLS algorithm does. 

e When Ry = BR! where £ is a constant, then the RLS algorithm is better than the 
LMS algorithm in producing the smaller values of the minimum levels of MSD and 
misadjustment. 


In summary, we should state that in practice the comparison of the acquisition and track- 
ing performance of LMS and RLS adaptive filters is a very complicated subject. Although 
the previous analysis provides some insight only extensive simulations in the context of a 
specific application can help to choose the appropriate algorithm. 


10.9 SUMMARY 


In this chapter we discussed the theory of operation, design, performance evaluation, imple- 
mentation, and applications of adaptive filters. The most significant attribute of an adaptive 
filter is its ability to incrementally adjust its coefficients so as to improve a predefined 
criterion of performance over time. 

We basically developed and analyzed two families of adaptive filtering algorithms: 


e The family of LMS FIR adaptive filters, which are based on a stochastic version of the 
steepest-descent optimization algorithm. 

e The family of RLS FIR adaptive filters, which are based on a stochastic version of the 
Newton-type optimization algorithms. 
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Both types of approaches can be used to develop adaptive algorithms for direct-form and 
lattice-ladder FIR filter structures. 

For LMS adaptive filters we focused on direct-form structures because those are the 
most widely used and studied. However, we briefly discussed transform-domain and sub- 
band implementations because they offer a viable solution for applications that require 
adaptive filters with very long impulse responses. 

All RLS FIR adaptive filters discussed in this chapter exhibit identical performance if 
they are implemented using infinite-precision arithmetic. However, they differ in terms of 
computational complexity and performance under finite-word-length implementations. The 
various types of RLS algorithms are summarized in Figure 10.40. We stress that algorithms 
for array processing can be used for FIR filtering (shift-invariant input data vector), but 
not vice versa. However, such a practice is not recommended because the computational 
complexity is much higher. The LMS algorithm (Section 10.4), the CRLS algorithm (Section 
10.5), and the QR decomposition—based algorithms (Section 10.6) are general and can be 
used for both array processing and FIR filtering applications. In contrast, the fast RLS 
algorithms in Section 10.7 can be used only for FIR filtering and prediction applications. 
The steady-state performance of LMS and RLS algorithms in a stationary environment is 
discussed in Sections 10.4 and 10.5, whereas their tracking performance in a nonstationary 
environment is analyzed in Section 10.8. 

The treatment of adaptive filters in this chapter has been quite extensive, in both number 
of topics and depth. However, the following important topics have been omitted: 


e IIR adaptive filters (Treichler et al. 1987; Johnson 1984; Shynk 1989; Regalia 1995; Netto 
et al. 1995; Williamson 1998). Although adaptive IIR filters have the potential to offer the 
same performance as FIR filters with less computational complexity, they are not widely 
used in practical applications. The main reasons are related to the nonquadratic nature 
of their performance error surface (see Section 6.2) and the additional stability problems 
caused by the presence of poles in their system function. 

e Adaptive filters using nonlinear filtering structures and neural networks (Grant and Mul- 
grew 1995; Haykin 1996; Mathews 1991). The need for such filters arises in applications 
involving nonlinear input-output relationships, nonlinear detectors (e.g., data equaliza- 
tion), and non-Gaussian or impulsive noise. The optimization required in some of these 
cases can be performed using genetic optimization algorithms (Tang et al. 1996). 

e FIR direct-form and lattice-ladder LS adaptive filters for multichannel signals (Slock 
1993; Ling 1993b; Carayannis et al. 1986). 


PROBLEMS 
10.1. Consider the process x(n) generated using the AR(3) model 
x(n) = —0.729x(n — 3) + w(n) 


where w(n) ~ WGN(0, 1). We want to design a linear predictor of x(n) using the SDA 
algorithm. Let 


Pn) = X(n) = co, 1x( — 1) + €o,2x(n — 2) + €0,3x(n — 3) 


(a) Determine the 3 x 3 autocorrelation matrix R of x(n), and compute its eigenvalues {A; 4 F 

(b) Determine the 3 x 1 cross-correlation vector d. 

(c) Choose the step size yz so that the resulting response is overdamped. Now implement the 
SDA 


Cx = [¢x,1 Ck,2 cg 3)" = C1 + 2u(d — Rex_}) 


and plot the trajectories of {cx ; Ys as a function of k. 
(d) Repeat part (c) by choosing jz so that the response is underdamped. 


10.2 


10.3 


10.4 


10.5 


10.6 


In the SDA algorithm, the index k is an iteration index and not a time index. However, we 
can treat it as a time index and use the instantaneous filter coefficient vector c, to filter data at 
n = k. This will result in an asymptotically optimum filter whose coefficents will converge to 
the optimum one. Consider the process x(n) given in Problem 10.1. 


(a) Generate 500 samples of x(n) and implement the asymptotically optimum filter. Plot the 
signal )(n). 

(b) Implement the optimum filter cg on the same sequence, and plot the resulting (7). 

(c) Comment on the above two plots. 


Consider the AR(2) process x(n) given in Example 10.3.1. We want to implement the Newton- 
type algorithm for faster convergence using 


ee = Ce) — UR! V P(ex_1) 


(a) Using ay = —1.5955 and ay = 0.95, implement the above method for ~ = 0.1 and 
Co = 0. Plot the locus of cz_1 versus cx. 

(b) Repeat part (a), using aj = —0.195 and az = 0.95. 

(c) Repeat parts (a) and (b), using the optimum step size for jw that results in the fastest 
convergence. 


Consider the adaptive linear prediction of an AR(2) process x(n) using the LMS algorithm in 
which 


x(n) = 0.95x(n — 1) —0.9x(n — 2) + w(n) 


where w(n) ~ WGN(O, gays The adaptive predictor is a second-order one given by a(n) = 
[a1(n) ap(n)]". 


(a) Implement the LMS algorithm given in Table 10.3 as a MATLAB function 
[c,e] = lplms(x,y,mu,M,c0). 


which computes filter coefficients in c and the corresponding error in e, given signal x, 
desired signal y, step size mu, filter order M, and the initial coefficient vector c0. 

(b) Generate 500 samples of x(7), and obtain linear predictor coefficients using the above 
function. Use step size jz so that the algorithm converges in the mean. Plot predictor 
coefficients as a function of time along with the true coefficients. 

(c) Repeat the above simulation 1000 times to obtain the learning curve, which is obtained 
by averaging the squared error le(n)|?. Plot this curve and compare its steady-state value 
with the theoretical MSE. 


Consider the adaptive echo canceler given in Figure 10.23. The FIR filter co(n) is given by 
Co(n) = (0.9) O<n<2 


In this simulation, ignore the far-end signal u(n). The data signal x(n) is a zero-mean, unit- 
variance white Gaussian process, and y(n) is its echo. 


(a) Generate 1000 samples of x(n) and determine y(n). Use these signals to obtain a fourth- 
order LMS echo canceler in which the step size jz is chosen to satisfy (10.4.40) and 
c(0O) = 0. Obtain the final echo canceler coefficients and compare them with the true ones. 

(b) Repeat the above simulation 500 times, and obtain the learning curve. Plot this curve along 
with the actual MSE and comment on the plot. 

(c) Repeat parts (a) and (b), using a third-order echo canceler. 

(d) Repeat parts (a) and (b), using one-half the value of jz used in the first part. 


The normalized LMS (NLMS) algorithm is given in (10.4.67), in which the effective step size 
is time-varying and is given by je/\Ix(n) 2, where 0 < ft < 1. 


(a) Modify the function firlms to implement the NLMS algorithm and obtain the function 


{c,e] = nfirlms(x,y,mu,M,c0). 
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10.7 


10.8 


10.9 


10.10 


10.11 


10.12 


(b) Choose jt = 0.1 and repeat Problem 10.4. Compare your results in terms of convergence 
speed. 

(c) Choose jz = 0.1 and repeat Problem 10.5(a) and (b). Compare your results in terms of 
convergence speed. 


Another variation of the LMS algorithm is called the sign-error LMS algorithm, in which the 
coefficient update equation is given by 
e(n) = e(n — 1) + 2 sgnle(n)]x(2) 
1 Refe(n)] > 0 
where sgn [e(n)] = 0 Re[e(n)] =0 
—1 Refe(n)] <0 
The advantage of this algorithm is that the multiplication is replaced by a sign change, and if 
is chosen as a negative power of 2, then the multiplication is replaced by a shifting operation 


that is easy and fast to implement. Furthermore, since sgn(x) = x/|x|, the effective step size 
jz is inversely proportional to the magnitude of the error. 


(a) Modify the function firlms to implement the sign-error LMS algorithm and obtain the 
function 


[c,e] = sefirlms(x,y,mu,M,c0). 


(b) Repeat Problem 10.4 and compare your results in terms of convergence speed. 
(c) Repeat Problem 10.5(a) and (b) and compare your results in terms of convergence speed. 


Consider an AR(1) process x(n) = ax(n — 1) + w(n), where w(n) ~ WGN(O, a7): We wish 
to design a one-step first-order linear predictor using the LMS algorithm 

k(n) = a(n — 1) x(n — 1) 

e(n) = x(n) — X(n) 

a(n) = a(n — 1) +2 we(n) x(n — 1) 
where 4 is the adaptation step size. 


(a) Determine the autocorrelation r, (J), the optimum first-order linear predictor, and the cor- 
responding MMSE. 

(b) Using the independence assumption, first determine and then solve the difference equation 
for E{a(n)}. 

(c) For a = +0.95, uw = 0.025, o% = 1, and0 <n < N = 500, determine the ensemble 
average of E{a(n)} using 200 independent runs and compare with the theoretical curve 
obtained in part (b). 

(d) Using the independence assumption, first determine and then solve the difference equation 
for P(n) = E{e2(n)}. 

(e) Repeat part (c) for P(n) and comment upon the results. 


Using the a posteriori error e(n) = y(n) — ct (n)x(n), derive the coefficient updating formulas 
for the a posteriori error LMS algorithm. Note: Refer to Equations (10.2.20) to (10.2.22). 


Solve the interference cancelation problem described in Example 6.4.1, using the LMS algo- 
rithm, and compare its performance to that of the optimum canceler. 


Repeat the convergence analysis of the LMS algorithm for the complex case, using formula 
(10.4.27) instead of (10.4.28). 


Consider the total transient excess MSE, defined by 
1 [o.@) 
pie). - 3 Pi (n) 
n=0 


in Section 10.4.3. 


10.13 


10.14 


10.15 


(a) Show that P."°) can be written as P) = a7 (1 — B)~!A0(0), where AQ (0) is the 
initial (1.e., at n = 0) deviation of the filter coefficients from their optimum setting. 
(b) Starting with the formula in step (a), show that 


s AG; (0) 
1 aa 1— 2d; 


(total) 
Pr = 


(c) Show that if wa, « 1, then 


M 
ETO) es 
(total) 1 j=1 1 
P. ed ras Aé6;(0 
us 4u 1 — wtr(R) qu i) 


which is formula (10.4.62), discussed in Section 10.4.3. 


The frequency sampling structure for the implementation of an FIR filter H(z) = ye a h(n): 
z” is specified by the following relation 


{—7-M Ma! H(ei2ak/My 
A(z) = 


A 

M | — efiekj/M,=1 ~ fl (z) Hp(z) 
k=0 

where H(z) is a comb filter with M zeros equally spaced on the unit circle and H(z) is a 

filter bank of resonators. Note that H(k) & H (e/ 2mk/M ), the DFT of {h(ny}i" hy, provides 

coefficients of the filter. Derive an LMS-type algorithm to update these coefficients, and sketch 

the resulting adaptive filter structure. 


There are applications in which the use of a non-MSE criterion may be more appropriate. To 

this end, suppose that we wish to design and study the behavior of an “LMS-like” algorithm 

that minimizes the cost function P*) = E{e2k (n)},k =1,2,3,..., using the model defined 

in Figure 10.19. 

(a) Use the instantaneous gradient vector to derive the coefficient updating formula for this 
LMS-like algorithm. 

(b) Using the assumptions introduced in Section 10.4.2 show that 


E(@@)} = [= 2uk 2k — Efe“? @ RIE — D) 
where R is the input correlation matrix. 
(c) Show that the derived algorithm converges in the mean if 
1 
kk — IE{e5*—? (n)}amax 


where Amax is the largest eigenvalue of R. 
(d) Show that for k = 1 the results in parts (a) to (c) reduce to those for the standard LMS 
algorithm. 


0<2p< 


Consider the noise cancelation system shown in Figure 10.6. The useful signal is a sinusoid 
s(n) = cos(won + ¢), where wy = 1/16 and the phase ¢ is a random variable uniformly 
distributed from 0 to 27. The noise signals are given by vj (n) = 0.9 vy (m — 1) + w(n) and 
v2(n) = —0.75 v2(n — 1) + w(n), where the sequences w(n) are WGN(O, 1). 


(a) Design an optimum filter of length M and choose a reasonable value for Mo by plotting 
the MMSE as a function of M. 

(b) Design an LMS filter with M, coefficients and choose the step size fz to achieve a 10 
percent misadjustment. 

(c) Plot the signals s(n), s(n) + v1 (1), v2(n), the clean signal eo (n) using the optimum filter, 
and the clean signal ej,,,(”) using the LMS filter, and comment upon the obtained results. 
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10.16 


10.17 


10.18 


10.19 


A modification of the LMS algorithm, known as the momentum LMS (MLMS), is defined by 
c(n) = e(n — 1) + 2pe* (n)x(n) + a[e(n — 1) — e(n — 2)] 


where |a| < 1 (Roy and Shynk 1990). 


(a) Rewrite the previous equation to show that the algorithm has the structure of a low-pass 
(0 <a@ < 1) orahigh-pass (—1 < a < 0) filter. 

(b) Explain intuitively the effect of the momentum term a[e(n — 1) — e(n — 2)] on the filter’s 
convergence behavior. 

(c) Repeat the computer equalization experiment in Section 10.4.4, using both the LMS and 
the MLMS algorithms for the following cases, and compare their performance: 

i. W=3.1, Uyms = Mmims = 0-01, w = 0.5. 

ii, W = 3.1, Lys = 0.04, Lmims = 0-01, @ = 0.5. 
iii, W = 3.1, Uims = Lmims = 0.04, w = 0.2. 
iv. W=4, ims = Umims = 9-03, w = 0.3. 


In Section 10.4.5 we presented the leaky LMS algorithm [see (10.4.88)] 
e(n) = (1 — ap)e(n — 1) + pe*(n)x(n) 


where 0 < a < 1 is the leakage coefficient. 


(a) Show that the coefficient updating equation can be obtained by minimizing 
P(n) = |e(n)? + ellen) |? 
(b) Using the independence assumptions, show that 
E{e(n)} = I— wR + aD] Ef{e(n — 1)} + ud 


where R = E{x(n)x# (n)} and d = E{x(n)y*(n)}. 
(c) Show that if 0 < uw < 2/(@ + Amax), Where Amax is the maximum eigenvalue of R, then 


lim E{e(n)} = (R +al)~!a 
n—-> oo 
that is, in the steady state E{c(oo)} A ey = Ra. 


There are various communications and speech signal processing applications that require the 
use of filters with linear phase (Manolakis et al. 1984). For simplicity, assume that m is even. 


(a) Derive the normal equations for an optimum FIR filter that satisfies the constraints 


i. ofl) — Jel) (linear phase) 
ii. oe) —Jele® (constant group delay). 
(b) Show that the obtained optimum filters can be expressed as ef?) = (em + Jem) and 


d . é ‘ 
(8 = 5(€m — Jem), where ¢,, is the unconstrained optimum filter. 


(c) Using the results in part (b) and the algorithm of Levinson, derive lattice-ladder structure 
for the constrained optimum filters. 
(d) Repeat parts (a), (b), and (c) for the linear predictor with linear phase, which is specified 
by al) = Jal). 
Y am am 
(e) Develop an LMS algorithm for the linear-phase filter el?) =J ofl?) and sketch the resulting 
structure. Can you draw any conclusions regarding the step size and the misadjustment of 


this filter compared to those of the unconstrained LMS algorithm? 


In this problem, we develop and analyze by simulation an LMS-type adaptive lattice predictor 
introduced in Griffiths (1977). We consider the all-zero lattice filter defined in (7.5.7), which 


is completely specified by the lattice parameters lean —! The input signal is assumed wide- 
sense stationary. 


(a) Consider the cost function 


Ee = E{\el, (n)|? + Je (n)|7} 


which provides the total prediction error power at the output of the mth stage, and show 
that 


ape 


oe 2E{eFme?_ja—-D +e Me (n)} 
on 


1 
(b) Derive the updating formula using the LMS-type approach 


km(n) = km(n — 1) — 2unylel* nye? _ (rn — 1) + eF*_, (ned, (n)] 


where the normalized step size u(n) = j2/ ae (n) is computed in practice by using the 
formula 


Em—1(n) =GEm—1a—1) +0 -ollel_ MP? + le @— DPI 


where 0 < @ < 1. Explain the role and proper choice of a, and determine the proper 
initialization of the algorithm. 

(c) Write a MATLAB function to implement the derived algorithm, and compare its performance 
with that of the LMS algorithm in the linear prediction problem discussed in Example 
10.4.1. 


10.20 Consider a signal x(n) consisting of a harmonic process plus white noise, that is, 
x(n) = Acos(@jn+ d) + w(n) 
where ¢ is uniformly distributed from 0 to 27 and w(n) ~ WGN(0, o%,). 


(a) Determine the output power o = E{ y? (n)} of the causal and stable filter 


(oe) 
y(n) = > h(k)x(n — k) 
k=0 


and show that we can cancel the harmonic process using the ideal notch filter 
1 O=| 


H(el®) = 
0 otherwise 


Is the obtained ideal notch filter practically realizable? That is, is the system function 
rational? Why? 
(b) Consider the second-order notch filter 


ws Diz) 1+ az!4z7? _ De) 
A(z) 1ltapz7!+ 2272 D(z/p) 
where —1 < p < 1 determines the steepness of the notch anda = —2 cos wg its frequency. 


We fix o, and we wish to design an adaptive filter by adjusting a. 
i. Show that for p ~ 1, oe = A2|H(eJ1)|2 + Gas and plot oy as a function of the 
frequency wo for w; = 7/6. 
ii. Evaluate do*(a)/da and show that the minimum of o2(a) occurs for a = —2 cos |. 
(c) Using a direct-form II structure for the implementation of H(z) and the property dY (z)/ 
da = [dH (z)/da)X (z), show that the following relations 


sy(n) = —a(n — I)psp(n — 1) — p?s9(n — 2) + (= gr)si(n— 1) 
g(n) = s9(n) — psp(n — 2) 

s1(n) = —a(n — I) pst(n — 1) — ps4 (n — 2) + x(n) 

y(n) = s(n) Fa(n — 1)sy(n — 1) + y(n — 2) 

a(n) = a(n — 1) — 2uy(n)g(n) 


constitute an adaptive LMS notch filter. Draw its block diagram realization. 

(d) Simulate the operation of the obtained adaptive filter for o = 0.9, wj = 1/6, and SNR 5 
and 15 dB. Plot wo(n) = arccos[—a(n)/2] as a function of n, and investigate the tradeoff 
between convergence rate and misadjustment by experimenting with various values of jw. 
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10.21 


10.22 


10.23 


10.24 


10.25 


10.26 


10.27 


10.28 


Consider the AR(2) process given in Problem 10.4. We will design the adaptive linear predictor 
using the RLS algorithm. The adaptive predictor is a second-order one given by e(n) = 
[e1(n) e2(n)]". 


(a) Develop a MATLAB function to implement the RLS algorithm given in Table 10.6 
[c,e] = rls(x,y, lambda, delta,M,c0) ; 


which computes filter coefficients in c and the corresponding error in e given signal x, 
desired signal y, forgetting factor Lambda, initialization parameter delta, filter order M, 
and the initial coefficient vector c0. To update P(n), compute only the upper or lower 
triangular part and determine the other part by using Hermitian symmetry. 
Generate 500 samples of x(n) and obtain linear predictor coefficients using the above 
function. Use a very small value for 6 (for example, 0.001) and various values of 4 = 0.99, 
0.95, 0.9, and 0.8. Plot predictor coefficients as a function of time along with the true 
coefficients for each 4, and discuss your observations. Also compare your results with 
those in Problem 10.4. 
(c) Repeat each simulation above 1000 times to get corresponding learning curves, which are 
obtained by averaging respective squared errors |e(n)|?. Plot these curves and compare 
their steady-state value with the theoretical MSE. 


(b 


Y 


Consider a system identification problem where we observe the input x(n) and the noisy output 
y(n) = yo(n) + v(n), for 0 < n < N — 1. The unknown system is specified by the system 
function 


0.0675 + 0.1349z~! + 0.067527? 
1 — 1.143027! + 0.41282~? 

and x(n) ~ WGN(0, 1), v(x) ~ WGN(0, 0.01), and N = 300. 

(a) Model the unknown system using an LS FIR filter, with M = 15 coefficients, using the 
no-windowing method. Compute the total LSE E), in the interval ng <n < N —1 for 
nog = 20. 

(b) Repeat part (a) for 0 <n < ng —1 (do not compute £),). Use the vector e(ng) and 
the matrix P(ng) = R—!(ng) to initialize the CRLS algorithm. Compute the total errors 
Eapr = bee e2(n) and Eapost = aes e2(n) by running the CRLS for ng <n < 
N-1. 

(c) Order the quantities Ejg, Eapr, Eapost by size and justify the resulting ordering. 


Ho (Zz) = 


Prove Equation (10.5.25) using the identity det(I; + AB) = det(I, + BA), where identity 
matrices I; and Iz and matrices A and B have compatible dimensions. Hint: Put (10.5.7) in 
the form I; + AB. 


Derive the normal equations that correspond to the minimization of the cost function (10.5.36), 
and show that for 6 = 0 they are reduced to the standard set (10.5.2) of normal equations. For 
the situation described in Problem 10.22, run the CRLS algorithm for various values of 6 and 
determine the range of values that provides acceptable performance. 


Modify the CRLS algorithm in Table 10.6 so that its coefficients satisfy the linear-phase 
constraint ¢ = Jc*. For simplicity, assume that M = 2L,; that is, the filter has an even number 
of coefficients. 


Following the approach used in Section 7.1.5 to develop the structure shown in Figure 7.1, 
derive a similar structure based on the Cholesky (not the LDL?) decomposition. 


Show that the partitioning (10.7.3) of Rinvt (n) to obtain the same partitioning structure as 
(10.7.2) is possible only if we apply the prewindowing condition x(—1) = 0. What is the 
form of the partitioning if we abandon the prewindowing assumption? 


Derive the normal equations and the LSE formulas given in Table 10.11 for the FLP and the 
BLP methods. 


10.36 


10.29 


10.30 


10.31 


10.32 


10.33 


Derive the FLP and BLP a priori and a posteriori updating formulas given in Table 10.12. 


Modify Table 10.14 for the FAEST algorithm, to obtain a table for the FTF algorithm, and write 
a MATLAB function for its implementation. Test the obtained function, using the equalization 
experiment in Example 10.5.2. 


If we wish to initialize the fast RLS algorithms (fast Kalman, FAEST, and FTF) using an exact 
method, we need to collect a set of data {x(n), yn)}o° for any ng > M. 


(a) Identify the quantities needed to start the FAEST algorithm at n = ng. Form the normal 
equations and use the LDL? decomposition method to determine these quantities. 

(b) Write a MaTLaB function faestexact .m that implements the FAEST algorithm using 
the exact initialization procedure described in part (a). 

(c) Use the functions faest .mand faestexact .mto compare the two different initialization 
approaches for the FAEST algorithm in the context of the equalization experiment in 
Example 10.5.2. Use np = 1.5M and ng = 3M. Which value of 6 gives results closest to 
the exact initialization method? 


Using the order-recursive approach introduced in Section 7.3.1, develop an order-recursive 
algorithm for the solution of the normal equations (10.5.2). Note: In Section 7.3.1 we could 
not develop a closed-form algorithm because some recursions required the quantities by, (n — 1) 
and E » (n — 1). Here we can avoid this problem by using time recursions. 


In this problem we discuss several quantities that can serve to warn of ill behavior in fast RLS 
algorithms for FIR filters. 


(a) Show that the variable 


in AEP (n-1 - 
ey eee 


satisfies the condition 0 < n,,(n) < 1. 
(b) Prove the relations 


m det Rn (n — 1) 
det R,, (1) 


det Ryn +1 (n) 
det R,,(n — 1) 


det Ryn +1 (n) 


Om (n) = 2» z 
det R,, (n) 


Ef (n) = Ep (n) = 


(c) Show that 


Ep (n) 
E}, (n) 


Am(n) =r” 


and use it to explain why the quantity n# (n) = BE (n) — 2” E> (n) can be used as a 
warning variable. 
(d) Explain how the quantities 


b 
4 5(M+1) a) 
SOS 4a ie ars 
aid npin) © Pin) — 2B = 1) BAP Cn) 


can be used as warning variables. 


10.34 When the desired response is y(j) = 6(j — 4), that is, a spike at 7 = k,O < k <n, the LS 


filter olf ) is known asa spiking filter or as an LS inverse filter (see Section 8.3). 


(a) Determine the normal equations and the LSE Ex ) (n) for the LS filter . 


(b) Show that ol” _ 2m(n) and Ev \(n) = Qm(n) and explain their meanings. 
(c) Use the interpretation a(n) = EP (n) to show that 0 < a(n) < 1. 


(d) Show that ayy (n) = Y"_9 eM) (nv — 1)x(k) and explain its meaning. 
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10.35 


10.37 


10.38 


10.39 


10.40 


10.41 


10.42 


10.43 


10.44 


Derive Equations (10.7.33) through (10.7.35) for the a posteriori LS lattice-ladder structure, 
shown in Figure 10.38, starting with the partitionings (10.7.1) and the matrix by inversion by 
partitioning relations (10.7.7) and (10.7.8). 


Prove relations (10.7.45) and (10.7.46) for the updating of the ladder partial correlation 
coefficient BF, (n). 


In Section 7.3.1 we derived order-recursive relations for the FLP, BLP, and FIR filtering 
MMSEs. 


(a) Following the derivation of (7.3.36) and (7.3.37), derive similar order-recursive relations 
for E!,(n) and E> (n). 

(b) Show that we can obtain a complete LS lattice-ladder algorithm by replacing, in Table 
10.15, the time-recursive updatings of E' (n) and E, us (n) with the obtained order-recursive 
relations. 

(c) Write a MATLAB function for this algorithm, and verify it by using the equalization exper- 
iment in Example 10.5.2. 


Derive the equations for the a priori RLS lattice-ladder algorithm given in Table 10.16, and 
write a MATLAB function for its implementation. Test the function by using the equalization 
experiment in Example 10.5.2. 


Derive the equations for the a priori RLS lattice-ladder algorithm with error feedback (see 
Table 10.7), and write a MATLAB function for its implementation. Test the function by using 
the equalization experiment in Example 10.5.2. 


Derive the equations for the a posteriori RLS lattice-ladder algorithm with error feedback (Ling 
et al. 1986) and write a MaTLas function for its implementation. Test the function by using 
the equalization experiment in Example 10.5.2. 


The a posteriori and the a priori RLS ae pata algorithms need the conversion factor 
Q@m(n) because the updating of the quantities Ef mn), En, b (n), Bm (n), and BF (n) requires both 
the a priori and a posteriori errors. Derive a double (a priori and a posteriori) lattice-ladder 
RLS filter that avoids the use of the conversion factor by updating both the a priori and the a 
posteriori prediction and filtering errors. 


Program the RLS Givens lattice-ladder filter with square roots (see Table 10.18), and study its 
use in the adaptive equalization experiment of Example 10.5.2. 


Derive the formulas and program the RLS Givens lattice-ladder filter without square roots (see 
Table 10.18), and study its use in the adaptive equalization experiment of Example 10.5.2. 


In this problem we discuss the derivation of the normalized lattice-ladder RLS algorithm, 
which uses a smaller number of time and order updating recursions and has better numerical 
behavior due to the normalization of its variables. Note: You may find useful the discussion in 
Carayannis et al. (1986). 


(a) Define the energy and angle normalized variables 


a (n) = ent) ny = 2m gy = Bm) 
Vain (ny) Em (7) Jam (ayy E®,(n) Vom (0) JEm(n) 
Ey = Bin™) ie BS,(n) 


V Eh (n)y Eb — 1) VEm()y EB (n) 


and show that the normalized errors and the partial correlation coefficients kim (n) and 
k(n) have magnitude less than 1. 


(b) Derive the following normalized lattice-ladder RLS algorithm: 
Ej(-1) = Eo(-1) = 5 > 0 
Forn = 0,1,2,... 
EX(n) = ABb(n — 1) + [x(n)*, Eo(n) = ALM — 1) + ly(ny/? 
x(n). y(n) 


af (n) = &(n) = ==, 2 
y Eo) 


Form = 0to M—1 


km (n) = vis | af, (n)[2y/1 — 82 — 1) [Pm (n — 1) + 2h (nye (n — 1) 

a m= (vi leh (n — 1)| 2VT— Tn) en) km(n) 2, (n — 1] 
2 “= (vi — a, (n) 2/1 — Fni@ol2) [28 (n — 1) — kin(n) eh, (n)] 

KE (n) = V1 = lém(n) 2 1 — [eb (ny 2K5, (@ — 1) + 2%, (02, (n) 

ém41(2) = (V 1 — |e (a)? — kin me) [@m(n) — ke, (nyeb, (n)] 


(c) Write a MATLAB function to implement the derived algorithm, and test its validity by using 
the equalization experiment in Example 10.5.2. 


10.45 Prove (10.6.46) by direct manipulation of (10.6.35). 


10.46 Derive the formulas for the QR-RLS lattice predictor (see Table 10.18), using the approach 
introduced in Section 10.6.3 (Yang and B6hme 1992). 


10.47 Demonstrate how the systolic array in Figure 10.55, which is an extension of the systolic 
array structure shown in Figure 10.36, can be used to determine the LS error e(n) and the LS 


x1(3) X9(2) x3(1) y(0) 
x,(2) X4(1) x3(0) 0 
x,(1) x4(0) 0 0 
x,(0) 0 0 0 


e(n) 


FIGURE 10.55 
Systolic array implementation of the extended QR-RLS algorithm. 
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Processor 
1 


FIGURE 10.56 


coefficient vector c(n). Determine the functions to be assigned to the dotted-line computing 
elements and the inputs with which they should be supplied. 


The implementation of adaptive filters using multiprocessing involves the following steps: (1) 
partitioning of the overall computational job into individual tasks, (2) allocation of compu- 
tational and communications tasks to the processors, and (3) synchronization and control of 
the processors. Figure 10.56 shows a cascade multiprocessing architecture used for adaptive 
filtering. To avoid latency (i.e., a delay between the filter’s input and output that is larger than 
the sampling interval), each processor should complete its task in time less than the sampling 
period and use results computed by the preceding processor and the scalar computational unit 
at the previous sampling interval. This is accomplished by the unit delays inserted between the 
processors. 


(a) Explain why the fast Kalman algorithm, given in Table 10.13, does not satisfy the multi- 
processing requirements. 
(b) Prove the formulas 


bin — 1) — ght (nye (n) 


b(n) = (k) 
1 = ga? (a) e*(n) 
a(n) = hrs (n) — gy (2) b(n) W 


and show that they can be used to replace formulas (g) and (/) in Table 10.13. 

(c) Rearrange the formulas in Table 10.13 as follows: (e), (k), (J), (a), (b), (c), (d), (f). 
Replace n by n — 1 in (e), (J), and (k). Show that the resulting algorithm complies with 
the multiprocessing architecture shown in Figure 10.56. 

(d) Draw a block diagram of a single multiprocessing section that can be used in the mul- 
tiprocessing architecture shown in Figure 10.56. Each processor in Figure 10.56 can be 
assigned to execute one or more of the designed sections. Note: You may find useful the 
discussions in Lawrence and Tewksbury (1983) and in Manolakis and Patel (1992). 

(e) Figure 10.57 shows an alternative implementation of a multiprocessing section that can 
be used in the architecture of Figure 10.56. Identify the input-output quantities and the 
various multiplier factors. 


Desired 
response 


Processor ; Scalar 
P 6 computations 


Processor 
2 


Cascade multiprocessing architecture for the implementation of FIR adaptive filters. 


10.49 


10.50 


Show that the LMS algorithm in Table 10.13 satisfies the multiprocessing architecture in 
Figure 10.56. 


Show that the a priori RLS linear prediction lattice (i.e., without the ladder part) algorithm 
with error feedback complies with the multiprocessing architecture of Figure 10.56. Explain 
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FIGURE 10.57 
Section for the multiprocessing implementation of the fast Kalman algorithm. 


why the addition of the ladder part violates the multiprocessing architecture. Can we rectify 
these violations? (See Lawrence and Tewksbury 1983.) 


10.51 The fixed-length sliding window RLS algorithm is given in (10.8.4) through (10.8.10). 


(a) Derive the above equations of this algorithm (see Manolakis et al. 1987). 
(b) Develop a MATLAB function to implement the algorithm 


[c,e] = slwrls(x,y,L,delta,M,c0); 


where L is the fixed length of the window. 
(c) Generate 500 samples of the following nonstationary process 


w(n) + 0.95x(n — 1) — 0.9x(n — 2) 0 <n < 200 
x(n) = 4 w(n) — 0.95x(n — 1) — 0.9x(n — 2) 200 <n < 300 
w(n) + 0.95x(n — 1) — 0.9x(n — 2) n > 300 


where w(n) is a zero-mean, unit-variance white noise process. We want to obtain a second- 
order linear predictor using adaptive algorithms. Use the sliding window RLS algorithm 
on the data and choose L = 50 and 100. Obtain plots of the filter coefficients and mean 
square error. 

(d) Now use the growing memory RLS algorithm by choosing 4 = 1. Compare your results 
with the sliding-window RLS algorithm. 

(e) Finally, use the exponentially growing memory RLS by choosing 4 = (L — 1)/(L + 1) 
that produces the same MSE. Compare your results. 


10.52 Consider the definition of the MSD D(n) in (10.2.29) and that of the trace of a matrix (A.2.16). 


(a) Show that D(n) = tr{®(n)}, where ®(n) is the correlation matrix of ¢(n). 
(b) For the evolution of the correlation matrix in (10.8.58), show that 
t(R7'R 
D(co) ~ uMo? + Ala 2) 
4u 


10.53 Consider the analysis model given in Figure 10.42. Let the parameters of this model be as 
follows: 


0.9 
—0.8 


¥(n)~WGN(O,Ry) — Ry = (0.01)71 


Co(n) model parameters: cy g(0) = | M=2 p =0.95 


Signal x(n) parameters: x(n) ~ WGN(O, R) R=I 


Noise u(n) parameters: vu(n) ~ WGN(O, 0°) oy =0.1 
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Simulate the system, using three values of ;1 that show slow, matched, and optimum adaptations 

of the LMS algorithm. 

(a) Obtain the tracking plots similar to Figure 10.43 for each of the above three adaptations. 

(b) Obtain the learning curve plots similar to Figure 10.44 for each of the above three adap- 
tations. 


10.54 Consider the analysis model given in Figure 10.42. Let the parameters of this model be as 
follows 


0.9 


A M=2  p=0.95 


Co(n) model parameters: c¢o(0) = | 


y(n) ~ WGNO, Ry) Ry = (0.01)°1 
Signal x(m) parameters: x(n) ~ WGN(O, R) R=I 
Noise v(m) parameters: vu(n) ~ WGN(O, 02) oy=0.1 
Simulate the system, using three values of ; that show slow, matched, and optimum adaptations 


of the RLS algorithm. 


(a) Obtain the tracking plots similar to Figure 10.49 for each of the above three adaptations. 

(b) Obtain the learning curve plots similar to Figure 10.50 for each of the above three adap- 
tations. 

(c) Compare your results with those obtained in Problem 10.53. 


10.55 Consider the time-varying adaptive equalizer shown in Figure 10.58 in which the time variation 
of the channel impulse response is given by 


h(n) = ph(n— 1) + V1 —pntn) 
with p =0.95 n(n) ~ WGN(O, V10) h(O) = 0.5 
Let the equalizer be a single-tap equalizer and v(n) ~ WGN(O, 0.1). 


(a) Simulate the system for three different adaptations; that is, choose yu for slow, matched, 
and fast adaptations of the LMS algorithm. 
(b) Repeat part (a), using the RLS algorithm. 


Data Channel pra? 
generator h(n) q 
c(n) 


FIGURE 10.58 
Adaptive channel equalizer system with time-varying channel in Problem 10.55. 
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Array Processing 


The subject of array processing is concerned with the extraction of information from signals 
collected using an array of sensors. These signals propagate spatially through a medium, 
for example, air or water, and the resulting wavefront is sampled by the sensor array. 
The information of interest in the signal may be either the content of the signal itself 
(communications) or the location of the source or reflection that produces the signal (radar 
and sonar). In either case, the sensor array data must be processed to extract this useful 
information. The methods utilized in most cases are extensions of the statistical and adaptive 
signal processing techniques discussed in previous chapters, such as spectral estimation and 
optimum and adaptive filtering, extended to sensor array applications. 

Sensor arrays are found in a wide range of applications, including radar, sonar, seis- 
mology, biomedicine, communications, astronomy, and imaging. Each of these individual 
fields contains a wealth of research into the various methods for the processing of array 
signals. Generally, the type of processing is dictated by the particular application. However, 
an underlying set of principles and techniques is common to a diverse set of applications. In 
this chapter, we focus on the fundamentals of array processing with emphasis on optimum 
and adaptive techniques. To simplify the discussion, we concentrate on linear arrays, where 
the sensors are located along a line. The extension of this material to other array config- 
urations is fairly straightforward in most cases. The intent of this chapter is to first give 
the uninitiated reader some exposure to the basic principles of array processing and then 
apply adaptive processing techniques to the array processing problem. For a more detailed 
treatment of array processing methods, see Monzingo and Miller (1980), Hudson (1981), 
Compton (1988), and Johnson and Dudgeon (1993). 

The chapter begins in Section 11.1 with a brief background in some array fundamen- 
tals, including spatially propagating signals, modulation and demodulation, and the array 
signal model. In Section 11.2, we introduce the concept of beamforming, that is, the spatial 
discrimination or filtering of signals collected with a sensor array. We look at conventional, 
that is, nonadaptive, beamforming and touch upon many of the common considerations for 
an array that affect its performance, for example, element spacing, resolution, and sidelobe 
levels. In Section 11.3, we look at the optimum beamformer, which is based on a priori 
knowledge of the data statistics. Within this framework, we discuss some of the specific as- 
pects of adaptive processing that affect performance in Section 11.4. Then, in Section 11.5, 
we discuss adaptive array processing methods that estimate the statistics from actual data, 
first block-adaptive and then sample-by-sample adaptive methods. Section 11.6 discusses 
other adaptive array processing techniques that were born out of practical considerations 
for various applications. The determination of the angle of arrival of a spatial signal is the 
topic of Section 11.7. In Section 11.8, we give a brief description of space-time adaptive 
processing. 
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11.1 ARRAY FUNDAMENTALS 


The information contained in a spatially propagating signal may be either the location of its 
source or the content of the signal itself. If we are interested in obtaining this information, 
we generally must deal with the presence of other, undesired signals. Much as a frequency- 
selective filter emphasizes signals at a certain frequency, we can choose to focus on signals 
from a particular direction. Clearly, this task can be accomplished by using a single sensor, 
provided that it has the ability to spatially discriminate; that is, it passes signals from certain 
directions while rejecting those from other directions. Such a single-sensor system, shown 
in Figure 11.1(a), is commonly found in communications and radar applications in which 
the signals are collected over a continuous spatial extent or aperture using a parabolic dish. 
The signals are reflected to the antenna in such a way that signals from the direction in which 
the dish is pointed are emphasized. The ability of a sensor to spatially discriminate, known as 
directivity, is governed by the shape and physical characteristics of its geometric structure. 
However, such a single-sensor system has several drawbacks. Since the sensor relies on 
mechanical pointing for directivity, it can extract and track signals from only one direction 
at a time; it cannot look in several directions simultaneously. Also, such a sensor cannot 
adapt its response, which would require physically changing the aperture, in order to reject 
potentially strong sources that may interfere with the extraction of the signals of interest. 


(a) Parabolic dish antenna (b) Sensor array antenna 
(continuous aperture) (discrete spatial aperture) 


FIGURE 11.1 

Comparison of a single, directive antenna with multiple sensors that make up an antenna 
array. In both cases, the response is designed to emphasize signals from a certain direction 
through spatial filtering, either continuous or discrete. 


An array of sensors has the ability to overcome these shortcomings of a single sensor. 
Figure 11.1 (5) illustrates the use of a sensor array. The sensor array signals are combined 
in such a way that a particular direction is emphasized. However, the direction in which the 


array is focused or pointed is almost independent of the orientation of the array. Therefore, 
the sensors can be combined in distinct, separate ways so as to emphasize different direc- 
tions, all of which may contain signals of interest. Since the various weighted summations 
of the sensors simply amount to processing the same data in different ways, these multiple 
sources can be extracted simultaneously. Also arrays have the ability to adjust the overall 
rejection level in certain directions to overcome strong interference sources. In this section, 
we discuss some fundamentals of sensor arrays. First, we give a brief description of spatially 
propagating signals and the modulation and demodulation operations. Then we develop a 
signal model, first for an arbitrary array and then by simplifying to the case of a uniform 
linear array. In addition, we point out the interpretation of a sensor array as a mechanism 
for the spatial sampling of a spatially propagating signal. 


11.1.1 Spatial Signals 


In their most general form, spatial signals are signals that propagate through space. These 
signals originate from a source, travel through a propagation medium, say, air or water, and 
arrive at an array of sensors that spatially samples the waveform. A processor can then take 
the data collected by the sensor array and attempt to extract information about the source, 
based on certain characteristics of the propagating wave. Since space is three-dimensional, 
a spatial signal at a point specified by the vector r can be represented either in Cartesian 
coordinates (x, y, z) orin spherical coordinates (R, ¢,,, 9e)) aS Shown in Figure 11.2. Here, 
R = ||r|| represents range or the distance from the origin, and ¢,, and 0. are the azimuth 
and elevation angles, respectively. 


ry = |r| singaz COs Oe] 
ry = Irlisinde 
rz = |r| Coshaz COs Oe 


> x 
FIGURE 11.2 
Three-dimensional space describing azimuth, elevation, and 
range. 


The propagation of a spatial signal is governed by the solution to the wave equation. For 
electromagnetic propagating signals, the wave equation can be deduced from Maxwell’s 
equations (Ishimaru 1990), while for sound waves the solution is governed by the basic laws 
of acoustics (Kino 1987; Jensen et al. 1994). However, in either case, for a propagating wave 
emanating from a source located at rg, one solution is a single-frequency wave given by 


A jan F(+— Erol) 


S(t, r) = ———~+e 
ae |r — roll? 


(11.1.1) 
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where A is the complex amplitude, F, is the carrier frequency of the wave, and c is the 
speed of propagation of the wave. The speed of propagation is determined by the type 
of wave (electromagnetic or acoustic) and the propagation medium. For the purposes of 
this discussion, we ignore the singularity at the source (origin); that is, s(t, ro) = oo. This 
equation suppresses the dependencies on ¢,, and 6) since the wave propagates radially from 
the source. At any point in space, the wave has the temporal frequency Fo. In (11.1.1) and for 
the remainder of this chapter, we will assume a lossless, nondispersive propagation medium, 
that is, a medium that does not attenuate the propagating signal further than predicted by 
the wave equation, and the propagation speed is uniform so that the wave travels according 
to (11.1.1). A dispersive medium adds a frequency dependence to the wave propagation 
(Jensen et al. 1994). Clearly, the signal travels in time where the spatial propagation is 
determined by the direct coupling between space and time in order to satisfy (11.1.1). We 
can then define the wavelength of the propagating wave as 


es (11.1.2) 
=F ae 


which is the distance traversed by the wave during one temporal period. 

Two other simplifying assumptions will be made for the remainder of this chapter. 
First, the propagating signals are assumed to be produced by a point source; that is, the size 
of the source is small with respect to the distance between the source and the sensors that 
measure the signal. Second, the source is assumed to be in the “far field,” i.e., at a large 
distance from the sensor array, so that the spherically propagating wave can be reasonably 
approximated with a plane wave. This approximation again requires the source to be far 
removed from the array so that the curvature of the wave across the array is negligible. This 
concept is illustrated in Figure 11.3. Multiple sources are treated through superposition of 
the various spatial signals at the sensor array. Although each individual wave radiates from 
its source, generally the origin (r = 0) is reserved for the position of the sensor array since 
this is the point in space at which the collection of waves is measured. For more details on 
spatially propagating signals, see Johnson and Dudgeon (1993). 


Near field Far field FIGURE 11.3 
Plane wave approximation in the far 


field of the source. 
Source C) ) eee 


Let us now consider placing a linear array in three-dimensional space in order to sense 
the propagating waves. The array consists of a series of elements located on a line with 
uniform spacing. Such an array is known as a uniform linear array (ULA). For convenience, 
we choose the coordinate system for our three-dimensional space as in Figure 11.2 such that 
the ULA lies on the x axis. In addition, we have a wave originating from a point r in this 
three-dimensional space that is located in the far field of the array such that the propagating 
signal can be approximated by a plane wave at the ULA. The plane wave impinges on the 
ULA as illustrated in Figure 11.4. As we will see, the differences in distance between the 
sensors determine the relative delays in arrival of the plane wave. The point from which 
the wave originates can be described by its distance from the origin ||r|| and its azimuth 
and elevation angles ¢,, and 6.1, respectively. If the distance between elements of the ULA 
is d, then the difference in propagation distance between neighboring elements for a plane 
wave arriving from an azimuth @¢,, and elevation 64 is 


dy = ||r|| sin @,, Cos O¢| (11.1.3) 
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FIGURE 11.4 


Cone angle ambiguity surface for a uniform linear array. 


These differences in the propagation distance that the plane wave must travel to each of 
the sensors are a function of a general angle of arrival with respect to the ULA @. If we 
consider the entire three-dimensional space, we note that equivalent delays are produced 
by any signal arriving from a cone about the ULA. Therefore, any signal arriving at the 
ULA on this surface has the same set of relative delays between the elements. This conical 
ambiguity surface is illustrated in Figure 11.4. For this reason, the angle of incidence to a 
linear array is commonly referred to as the cone angle, @eone. We see that the cone angle is 
related to the physical angles, azimuth and elevation defined in Figure 11.4, by 


sin @ = sin dg, COS O¢| (11.1.4) 


where @ = 90° — @eone- In this manner, we can take a given azimuth and elevation pair 
and determine their corresponding cone angle. For the remainder of this chapter, we use the 
terms angle of arrival and simply angle interchangeably. 


11.1.2 Modulation-Demodulation 


The spatial propagation of signals was described by (11.1.1) using a propagation speed c 
and a center frequency Fy. For a general class of signals, the signal of interest so(t) has a 
bandwidth that is a small fraction of the center frequency and is modulated up to the center 
frequency. Since the propagating wave then “carries” certain information to the receiving 
point in the form of a temporal signal, F, is commonly referred to as the carrier frequency. 
The process of generating the signal so(t) from sg(f) in order to transmit this information 
is accomplished by mixing the signal so(t) with the carrier waveform cos 2z Ff in an 
operation known as modulation. The propagating signal is then produced by a high-gain 
transmitter. The signal travels through space until it arrives at a sensor that measures the 
signal. Let us denote the received propagating signal as 


So (t) = so(t) cos 2m Fet = 550(t)(el7™*! + e127 Fer) (11.1.5) 


where we say that the signal so(t) is carried by the propagating waveform cos 27 F,t. The 
spectrum of so(t) is made up of two components: the spectrum of the signal so(t) shifted 
to F, and shifted to — Fy and reflected about — F,. This spectrum So(F ) is shown in Figure 
11.5. Here we indicate the signal sg(t) has a bandwidth B. The baseband signal so(t), 
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FIGURE 11.5 
Spectrum of a bandpass signal. 


although originating as a real signal prior to modulation, has a nonsymmetric spectrum due 
to the asymmetric’ spectral response of the propagation medium about frequency F.. The 
received signal so(t), though, is real-valued; that is, its spectrum exhibits even symmetry 
about F = 0. This fact is consistent with actual, physical signals that are real-valued as 
they are received and measured by a sensor. 

The reception of spatially propagating signals with a sensor is only the beginning of 
the process of forming digital samples for both the in-phase and quadrature components of 
the sensor signal. Upon reception of the signal so(t), the signal is mixed back to baseband 
in an operation known as demodulation. Included in the mth sensor signal is thermal noise 
due to the electronics of the sensor w(t) 


Xm(t) = So(t) * Am(t, bs) + Wm) (11.1.6) 


where h(t, @,) is the combined temporal and spatial impulse response of the mth sensor. 
The angle ¢, is the direction from which So(t) was received. In the case of an omnidirectional 
sensor with an equal response in all directions, the impulse response no longer is dependent 
on the angle of the signal. The demodulation process involves multiplying the received 
signal by cos 27 F,t and — sin 27 F.t to form both the in-phase and quadrature channels, 
respectively. Note the quadrature component is 90° out of phase of the in-phase component. 
The entire process is illustrated in Figure 11.6 for the mth sensor. This structure is referred 
to as the receiver of the mth channel. 

Following demodulation, the signals in each channel are passed through a low-pass 
filter to remove any high-frequency components. The cutoff frequency of this low-pass filter 
determines the bandwidth of the receiver. Throughout this chapter, we assume a perfect or 
ideal low-pass filter, that is, aresponse of | in the passband and 0 in the stopband. In practice, 
the characteristics of the actual, nonideal low-pass filter can impact the performance of the 
resulting processor. Following the low-pass filtering operation, the signals in both the in- 
phase and quadrature channels are critically (Nyquist) sampled at the receiver bandwidth B. 
Oversampling at greater than the receiver bandwidth is also possible but is not considered 
here. More details on the signals at the various stages of the receiver, including the sensor 
impulse response, are covered in the next section on the array signal model. The output of 
the receiver is a complex-valued, discrete-time signal for the mth sensor with the in-phase 
and quadrature channels generating the real and imaginary portions of the signal 


Xm(n) = x(n) + jx (0) (11.1.7) 
For more details on the complex representation of bandpass signals, sampling, and the 


modulation and demodulation process, see Section 2.1. We should also mention that the 
sampling process in many systems is implemented using a technique commonly referred 


The asymmetry can arise from dispersive effects in the transmission medium. 
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FIGURE 11.6 
Block diagram of propagating signal arriving at a sensor with a receiver. 


to as digital in-phase/quadrature or simply digital IQ (Rader 1984), rather than the more 
classical method outlined in this section. The method is more efficient as it only requires 
a single analog-to-digital (A/D) converter, though at a higher sampling rate.’ See Rader 
(1984) for more details. 


11.1.3 Array Signal Model 


We begin by developing a model for a single spatial signal in noise received by a ULA. 
Consider a signal received by the ULA from an angle @¢, as in Figure 11.6. Each sensor 
receives the spatially propagating signal and converts its measured energy to voltage. This 
voltage signal is then part of the receiver channel from Figure 11.6. In addition, the receiver 
contains noise due to internal electronics known as thermal noise.’ Recall from (11.1.6) that 
Xm(t) is the continuous-time signal in the mth sensor containing both the received carrier- 
modulated signals and thermal noise. The signal x,,(¢) is then obtained by demodulating 
Xm(t) to baseband and low-pass filtering to the receiver bandwidth, while x,,(m) is its 
discrete-time counterpart. Since the model is assumed to be linear, the extension to multiple 
signals, including interference sources, is straightforward. 

The discrete-time signals from a ULA may be written as a vector containing the indi- 
vidual sensor signals 


x(n) = [x1 (n) x2(n) ++» x(n)" (11.1.8) 


where M is the total number of sensors. A single observation or measurement of this signal 
vector is known as an array snapshot. We begin by examining a single, carrier-modulated 
signal so(t) = so(t) cos 27 Fet arriving from angle @, that is received by the mth sensor. We 
assume that the signal so(t) has a deterministic amplitude and random, uniformly distributed 
phase.The ~ symbol is used to indicate that the signal is a passband or carrier-modulated 
signal. Here so(t) is the baseband signal, and Fy is the carrier frequency. This signal is 
received by the mth sensor with a delay T, 


Xm(t) = h(t, s) * So(t — Tm) + Wm (Ct) (11.1.9) 


This digital IQ technique is very important for adaptive processing as I/Q channel mismatch can limit performance. 
One A/D converter avoids this source of mismatch. 

* another source of noise may be external background noise. Many times this is assumed to be isotropic so that 
the overall noise signal is uncorrelated from sensor to sensor. 
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where h(t, @) is the impulse response of the mth sensor as a function of both time and angle 
@, and W(t) is the sensor noise. Note that the relative delay at the mth sensor T,, is also a 
function of ¢,. We have temporarily suppressed this dependence to simplify the notation. 
Usually, we set t; = 0, in which case the delays to the remaining sensors (m = 2,3,..., M) 
are simply the differences in propagation time of S(t) to these sensors with respect to the 
first sensor. The sensor signal can also be expressed in the frequency domain as 


Xm(F) = Hm(F, 5) So(F)e7227"™ 4 Win (F) 
= An(F, ¢;)[So(F — Fe) + Sé(-F — Fe)le7?7F™ +. Win (F) 


by using (11.1.5) and taking the Fourier transform of (11.1.9). Following demodulation 
and ideal low-pass filtering of the signal from the mth sensor, as shown in Figure 11.6, the 
spectrum of the signal is 


Xm(F) = Hmn(F + Fe, ;)So(Fye F27Ft Foltm +. Win (F) (11.1.11) 


(11.1.10) 


where X,(F') = x® (F)+ jx (F’). The second term SO (—F —2F,) has been removed 
through the ideal low-pass filtering operation. This ideal low-pass filter has a value of unity 
across its passband so that W,,(F) = Win(F + F.) for |F| < B/2. 

We now make a critical, simplifying assumption: The bandwidth of so(t) is small 
compared to the carrier frequency; this is known as the narrowband assumption. This 
assumption allows us to approximate the propagation delays of a particular signal between 
sensor elements with a phase shift. There are numerous variations on this assumption, but in 
general it holds for cases in which the signal bandwidth is less than some small percentage of 
the carrier frequency, say, less than | percent. The ratio of the signal bandwidth to the carrier 
frequency is referred to as the fractional bandwidth. However, the fractional bandwidth for 
which the narrowband assumption holds is strongly dependent on the length of the array and 
the strength of the received signals. Thus, we might want to consider the time-bandwidth 
product (TBWP), which is the maximum amount of time for a spatial signal to propagate 
across the entire array (6, = +£90°). If TBWP < 1, then the narrowband assumption is 
valid. The effects of bandwidth on performance are treated in Section 11.4.2. 

In addition to the narrowband assumption, we assume that the response of the sensor 
is constant across the bandwidth of the receiver, that is, Hin(F + Fo, 6s) = Hn(Fe, $s) for 
|F| < B/2. Thus, the spectrum in (11.1.11) simplifies to 


Xm(F) = Hn (Fe, $5) So( Fe 77 "™ +. Win (F) (1.1.12) 


and the discrete-time signal model is obtained by sampling the inverse Fourier transform 
of (11.1.12) 


Xm(n) = Hm(Fe, b,)so(nye 277 Fe™ +. wm (n) (11.1.13) 


The term w (1) corresponds to W,, (F’), the sensor thermal noise across the bandwidth of 
the receiver of the mth sensor. Furthermore, we assume that the power spectral density of this 
noise is flat across this bandwidth; that is, the discrete-time noise samples are uncorrelated. 
Also, the thermal noise in all the sensors is mutually uncorrelated.’ If we further assume 
that each of the sensors in the array has an equal, omnidirectional response at frequency Fo, 
that is, Hin(Fc, 6s) = H(Fc, 6s) = constant, for 1 < m < M, then the constant sensor 
responses can be absorbed into the signal term’ 


s(n) = H(Fe)so(n) (11.1.14) 


Tn actual systems, thermal noise samples are temporally correlated through the use of antialiasing filters prior to 
digital sampling. In addition, the thermal noise between sensors may be correlated due to mutual coupling of the 
sensors. 

i In many systems, we can compensate for differences in responses by processing signals from the sensors in an 
attempt to make their responses as similar as possible. When the data from the sensors are used to perform this 
compensation, the process is known as adaptive channel matching. 


For the remainder of the chapter, we use the signal s(n) as defined in (11.1.14). Using 
(11.1.8) and (11.1.13), we can then write the full-array discrete-time signal model as 


x(n) = VM v(¢,)s(n) + w(n) (11.1.15) 
where v(p) = il oe J27 Feta(p) paren e fr ketm@)T (11.1.16) 


is the array response vector. We have chosen to measure all delays relative to the first 
sensor [t;(@) = 0] and are now indicating the dependence of these delays on @. We use the 
normalization of 1 /./M for mathematical convenience so that the array response vector has 
unit norm, that is, ||v(¢)||?7 = v4 (¢)v(@) = 1. The factor is compensated for with the JM 
term in (11.1.15). The assumption of equal, omnidirectional sensor responses is necessary 
to simplify the analysis but should always be kept in mind when considering experimentally 
collected data for which this assumption certainly will not hold exactly. The other critical 
assumption made is that we have perfect knowledge of the array sensor locations, which 
also must be called into question for actual sensors and the data collected with them. 

Up to this point, we have not made any assumptions about the form of the array, so that 
the array signal model we have developed holds for arbitrary arrays. Now we wish to focus 
our attention on the ULA, which is an array that has all its elements on a line with equal spac- 
ing between the elements. The ULA is shown in Figure 11.7, and the interelement spacing 
is denoted by d. Consider the single propagating signal that impinges on the ULA from an 
angle @. Since all the elements are equally spaced, the spatial signal has a difference in prop- 
agation paths between any two successive sensors of d sin ¢ that results in a time delay of 

rig) = ne 


where c is the rate of propagation of the signal. As a result, the delay to the mth element 
with respect to the first element in the array is 


(11.1.17) 


dsing 


tm(@) = (n — 1) — — (11.1.18) 
and substituting into (11.1.16), we see the array response vector for a ULA is 
¥(6) = 1 [1 e J 2mL(d sin $)/A] coe eg Palieene Ale (11.1.19) 


2] 


since Fy = c/d. 


Incoming 
signal 


FIGURE 11.7 
Plane wave impinging on a uniform linear array. 
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11.1.4 The Sensor Array: Spatial Sampling 


In general, we can think of a sensor array as a mechanism for spatially sampling wavefronts 
propagating at a certain operating (carrier) frequency. Since in most instances the user 
either controls or has knowledge of the operating frequency, the sensor array provides a 
reliable means of interrogating the incoming wavefront for information. Similar to temporal 
sampling, the sensor array provides discrete (spatially sampled) data that can be used without 
loss of information, provided certain conditions are met. Namely, the sampling frequency 
must be high enough so as not to create spatial ambiguities or, in other words, to avoid 
spatial aliasing. The advantages of discrete-time processing and digital filtering have been 
well documented (Oppenheim and Schafer 1989; Proakis and Manolakis 1996). In the case 
of the spatial processing of signals, spatial sampling using an array provides the capability 
to change the characteristics of a discrete spatial filter, which is not possible for a continuous 
spatial aperture. 

An arbitrary array performs its sampling in multiple dimensions and along anonuniform 
grid so that it is difficult to compare to discrete-time sampling. However, a ULA has a direct 
correspondence to uniform, regular temporal sampling, since it samples uniformly in space 
on a linear axis. Thus, for a ULA we can talk about a spatial sampling frequency Us defined 
by 


ie (11.1.20) 


where the spatial sampling period is determined by the interelement spacing d and is 
measured in cycles per unit of length (meters). Recall from (11.1.19) that the measurements 
made with a ULA on a narrowband signal correspond to a phase progression across the 
sensors determined by the angle of the incoming signal. As with temporal signals, the phase 
progression for uniform sampling is a consequence of the frequency; that is, consecutive 
samples of the same signal differ only by a phase shift of e/?””, where F is the frequency. 
In the case of a spatially propagating signal, this frequency is given by 

poe (1.1.21) 

rv 

which can be thought of as the spatial frequency. The normalized spatial frequency is then 
defined by 


a U _ dsing 
u = U, => ry 
Therefore, we can rewrite the array response vector from (11.1.19) in terms of the normalized 
spatial frequency as 


(11.1.22) 


v(o) = v(u) = sail Bg (11.1.23) 


which we note is simply a Vandermonde vector (Strang 1998), that is, a vector whose 
elements are successive integer powers of the same number, in this case e~/?7", 

The interelement spacing d is simply the spatial sampling interval, which is the inverse 
of the sampling frequency. Therefore, similar to Shannon’s theorem for discrete-time sam- 
pling, there are certain requirements on the spatial sampling frequency to avoid aliasing. 
Since normalized frequencies are unambiguous for -} <u < } and the full range of 


2 
possible unambiguous angles is —90° < ¢ < 90°, the sensor spacing must be 


Xr 
d< 5 (11.1.24) 
to prevent spatial ambiguities. Since lowering the array spacing below this upper limit 
only provides redundant information and directly conflicts with the desire to have as much 
aperture as possible for a fixed number of sensors, we generally set d = 1/2. This tradeoff 
is further explored using beampatterns in the next section. 


11.2 CONVENTIONAL SPATIAL FILTERING: BEAMFORMING 


In many applications, the desired information to be extracted from an array of sensors is 
the content of a spatially propagating signal from a certain direction. The content may be 
a message contained in the signal, such as in communications applications, or merely the 
existence of the signal, as in radar and sonar. To this end, we want to linearly combine the 
signals from all the sensors in a manner, that is, with a certain weighting, so as to examine 
signals arriving from a specific angle. This operation, shown in Figure 11.8, is known as 
beamforming because the weighting process emphasizes signals from a particular direction 
while attenuating those from other directions and can be thought of as casting or forming a 
beam. In this sense, a beamformer is a spatial filter; and in the case of a ULA, it has a direct 
analogy to an FIR frequency-selective filter for temporal signals, as discussed in Section 
1.5.1. Beamforming is commonly referred to as “electronic” steering since the weights are 
applied using electronic circuitry following the reception of the signal for the purpose of 
steering the array in a particular direction.’ This can be contrasted with mechanical steering, 
in which the antenna is physically pointed in the direction of interest. For a complete tutorial 
on beamforming see Van Veen and Buckley (1988, 1998). 


X,(n) 


y(n) 


Xy(n) 


FIGURE 11.8 
Beamforming operation. 


In its most general form, a beamformer produces its output by forming a weighted 
combination of signals from the M elements of the sensor array, that is, 


M 
1) =) om) Se" x) (11.2.1) 
m=1 
where ce =[c1 c2 --: cu) (11.2.2) 


is the column vector of beamforming weights. The beamforming operation for an M element 
array is illustrated in Figure 11.8. 


. In general, performance does degrade as the angle to which the array is steered approaches @ = —90° or 
¢ = 90°. Although the array is optimized at broadside (@ = 0°), it certainly can steer over a wide range of angles 
about broadside for which performance degradation is minimal. 
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Beam response 


A standard tool for analyzing the performance of a beamformer is the response for a 
given weight vector c as a function of angle ¢, known as the beam response. This angular 
response is computed by applying the beamformer c to a set of array response vectors from 
all possible angles, that is, —90° < ¢ < 90°, 


C(o) = ce" vo) (11.2.3) 


Typically, in evaluating a beamformer, we look at the quantity |C (#)|*, which is known as the 
beampattern. Alternatively, the beampattern can be computed as a function of normalized 
spatial frequency u from (11.1.22). For a ULA with 4/2 element spacing, the beampattern 
as a function of u can be efficiently computed using the FFT for -3 <u< 5 at points 
separated by 1/Ne where Ne; > M is the FFT size. Thus, a beampattern can be computed 
in MATLAB with the command c=fftshift (fft (c,N_fft) ) /sqrt (M), where the FFT size 
is selected to display the desired level of detail. To compute the corresponding angles of 
the beampattern, we can simply convert spatial frequency to angle as 


Xr 
od = arcsin 7 u (11.2.4) 


A sample beampattern for a 16-element uniform array with uniform weighting (cy, = 
1/\/M) is shown in Figure 11.9, which is plotted on a logarithmic scale in decibels. The 
large mainlobe is centered at @ = 0°, the direction in which the array is steered. Also 
notice the unusual sidelobe structure created by the nonlinear relationship between angle 
and spatial frequency in (11.2.4) at angles away from broadside (@ = 0°). 
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A sample beampattern of a spatial matched filter for an M = 16 
element ULA steered to ¢ = 0°. 


Important note. The beampattern is the spatial frequency response of a given beam- 
former. It should not be confused with the steered response, which is the response of the 
array to a certain set of spatial signals impinging on the array as we steer the array to all 
possible angles. Since this operation corresponds to measuring the power as a function of 
spatial frequency or angle, the steered response might be better defined as the spatial power 
spectrum 


R(o) = Ef|e” (¢)x(n)|*} (11.2.5) 


where the choice of the beamformer c(@) determines the type of spatial spectrum, say, 
conventional or minimum-variance. Various spectrum estimation techniques were discussed 
in Chapters 5 and 9, several of which can be extended for the estimation of the spatial 
spectrum from measurements in practical applications. One interpretation of the estimation 
of the spectrum was made as a bank of frequency-selective filters at the frequencies at 
which the spectrum is computed. Similarly, the computation of the spatial spectrum can be 
thought of as the output of a bank of beamformers steered to the angles at which the spatial 
spectrum is computed. 


Output signal-to-noise ratio 


We now look at the signal-to-noise ratio (SNR) of the beamformer output and determine 
the improvement in SNR with respect to each element, known as the beamforming gain. Let 
us consider the signal model for a ULA from (11.1.15), which consists of a signal of interest 
arriving from an angle ¢, and thermal sensor noise w(n). The beamformer or spatial filter 
c is applied to the array signal x(n) as 


y(n) = ec’ x(n) = J Me" v(¢,)s(n) + w(n) (11.2.6) 
where w(n) = c% w(n) is the noise at the beamformer output and is also temporally uncor- 
related. The beamformer output power is 

Py = Efly)|*} = c7# Rye (11.2.7) 

where R, = E{x(n)x# (n)} (11.2.8) 

is the correlation matrix of the array signal x(7). Recall from (11.1.15) and (11.1.23) that 
the signal for the mth element is given by 

Xm(n) = e127" —Ds s(n) + win (0) (11.2.9) 

where u, is the normalized spatial frequency of the array signal produced by s(n). The 


signal s(n) is the signal of interest within a single sensor including the sensor response 
Ay (Fc) from (11.1.14). Therefore, the signal-to-noise ratio in each element is given by 


2 —j2a(m—1)us 2 
SNRetem 4 2 = le oe) (1.2.10) 
Ow E{|wm(n)|*} 
where o — E{\|s(n)|7} and Ce =E{|wm (n)|7} are the element level signal and noise 


powers, respectively. Recall that the signal s(7) has a deterministic amplitude and random 
phase. We assume that all the elements have equal noise power ce so that the SNR does 
not vary from element to element. This SNRejem is commonly referred to as the element 
level SNR or the SNR per element. 

Now if we consider the signals at the output of the beamformer, the signal and noise 
powers are given by 


P, = E{|\VM[c# vo, |s(n) 7} = Mo2|e4v(@,) (11.2.11) 


Py, = E{|e" w(n)|?} = ce Rae = |le||?o2, (11.2.12) 
because Ry = oI. Therefore, the resulting SNR at the beamformer output, known as the 
array SNR, is 
Po Mc! vg)? a2 _ |e* vb, 
Py lle? or, llel|? 
which is simply the product of the beamforming gain and the element level SNR. Thus, the 
beamforming gain is given by 


SNRarray = M SNRetem (11.2.13) 


SNR, ee ‘ 
Gop & oNRarray _ | vga M (11.2.14) 
SNReiem le| 
The beamforming gain is strictly a function of the angle of arrival ¢, of the desired signal, 


the beamforming weight vector c, and the number of sensors M. 
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11.2.1 Spatial Matched Filter 


Recall the array signal model of a single signal, arriving from a direction @,, with sensor 
thermal noise 


x(n) = /Mv(9,)s(n) + wn) 


11.2.15 
= [s(n) e~/?*4s s(n) «-. e~F27M—Dus s(n)? +. win) 


where the components of the noise vector w() are uncorrelated and have power oe. that is, 


E{w(n)w" (n)} = oF. The individual elements of the array contain the same signal s(n) 
with different phase shifts corresponding to the differences in propagation times between 
elements. Ideally, the signals from the M array sensors are added coherently, which requires 
that each of the relative phases be zero at the point of summation; that is, we add s(n) with 
a perfect replica of itself. Thus, we need a set of complex weights that results in a perfect 
phase alignment of all the sensor signals. The beamforming weight vector that phase-aligns 
a signal from direction ¢, at the different array elements is the steering vector, which is 
simply the array response vector in that direction, that is, 


The steering vector beamformer is also known as the spatial matched. filter’ since the steering 
vector is matched to the array response of signals impinging on the array from an angle 
g,. As aresult, @, is known as the look direction. The use of the spatial matched filter is 
commonly referred to as conventional beamforming. 

The output of the spatial matched filter is 


y(n) = eh (bs)x(0) = Vv" ($,) x(n) 


= sail el 2Tus err gies) 
s(n) 
e  J27Us (7) 
Tile + w(n) (1.2.17) 


e—J20(M= us 5 (7) 
— ql + s(n) + --- +s(n)] + w(n) 
= JM s(n) + w(n) 


where again w(n) = oe (@,)w(n) is the beamformer output noise. Examining the array 
SNR of the spatial matched filter output, we obtain 


Ps Mo2 


SNRaray = — = s 11.2.18 
mer Py Efivl(b,)w(n)|7} : 
MGs = = oy Se eR 
VAG RaVs) ia 


since P; = M o and R, = o2L. Therefore, the beamforming gain is 
Gop = M (11.2.19) 


that is, equal to the number of sensors. In the case of spatially white noise, the spatial matched 
filter is optimum in the sense of maximizing the SNR at the output of the beamformer. Thus, 


"The spatial matched filter should not be confused with the optimum matched filters discussed in Section 6.9 that 
depend on the correlation of the data. However, it is optimum in the case of spatially uncorrelated noise. 


the beamforming gain of the spatial matched filter is known as the array gain because it is 
the maximum possible gain of a signal with respect to sensor thermal noise for a given array. 
Clearly from this perspective, the more elements in the array, the greater the beamforming 
gain. However, physical reality places limitations on the number of elements that can be 
used. The spatial matched filter maximizes the SNR because the individual sensor signals 
are coherently aligned prior to their combination. However, as we will see, other sources of 
interference that have spatial correlation require other types of adaptive beamformers that 
maximize the signal-to-interference-plus-noise ratio (SINR). 

The beampattern of the spatial matched filter can serve to illustrate several key per- 
formance metrics of an array. A sample beampattern of a spatial matched filter was shown 
in Figure 11.9 for ¢, = 0°. The first and most obvious attribute is the large lobe cen- 
tered on ¢,, known as the mainlobe or mainbeam, and the remaining, smaller peaks are 
known as sidelobes. The value of the beampattern at the desired angle ¢ = ¢, is equal to 
1 (0 dB) due to the normalization used in the computation of the beampattern. A response 
of less than 1 in the look direction corresponds to a direct loss in desired signal power at 
the beamformer output. The sidelobe levels determine the rejection of the beamformer to 
signals not arriving from the look direction. The second attribute is the beamwidth, which 
is the angular span of the mainbeam. The resolution of the beamformer is determined by 
this mainlobe width, with smaller beamwidths resulting in better angular resolution. The 
beamwidth is commonly measured from the half-power (—3-dB) points A@3 gp or from 
null to null of the mainlobe A¢@,,,. Using the beampattern, we next set out to examine the 
effects of the number of elements and their spacing on the array performance in the context 
of the spatial matched filter. However, in the following example, we first illustrate the use 
of a spatial matched filter to extract a signal from noise. 


EXAMPLE 11.2.1. A signal received by a ULA with M = 20 elements and 4/2 spacing contains 
both a signal of interest at 6, = 20° with an array SNR of 20 dB and thermal sensor noise with 
unit power (02, = 1). The signal of interest is an impulse present only in the 100th sample and 
is produced by the sequence of MATLAB commands 


u_s=(d/lambda)* sin(phi_s*pi/180) ; s=zeros (M,N); 
s(:,100)=(10% (SNR/20) ) *exp (-3*2*pi*u_s*[ (0: (M-1))]/M)/sqrt (M) ; 


The uncorrelated noise samples with a Gaussian distribution are generated by 
w=(randn (M,N) +j*randn (M,N) )/sqrt (2); 


The two signals are added to produce the overall array signal x = s + w. Examining the signal 
at a single sensor in Figure 11.10 (a), we see that the signal is not visible at n = 100 since the 
element level SNR is only 7 dB (full-array SNR minus M in decibels). The output power of this 
sample for a given realization can be more or less than the expected SNR due to the addition of 
the noise. However, when we apply a spatial matched filter using 


c_mf=exp (-}*2*pi*u_s*[ (0: (M-1))]/M) /sqrt (M) ; 
y=c_mf! *x; 


we can clearly see the signal of interest since the array SNR is 20 dB. As a rule of thumb, we 
require the array SNR to be at least 10 to 12 dB to clearly observe the signal. 


Element spacing 


In Section 11.1.4, we determined that the element spacing must be d < 4/2 to prevent 
spatial aliasing. Here, we relax this restriction and look at various element spacings and the 
resulting array characteristics, namely, their beampatterns. In Figure 11.11, we show the 
beampatterns of spatial matched filters with ¢, = 0° for ULAs with element spacings of 1/4, 
A/2, A, and 2A (equal-sized apertures of 10A with 40, 20, 10, and 5 elements, respectively). 
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(b) Spatial matched filter output signal 


FIGURE 11.10 
The spatial signals from (a) an individual sensor and (b) the output of a spatial matched filter 
beamformer. 


We note that the beampatterns for 4/4 and 4/2 spacing are identical with equal-sized 
mainlobes and the first sidelobe having a height of —13 dB. The oversampling for the array 
with an element spacing of 4/4 provides no additional information and therefore does not 
improve the beamformer response in terms of resolution. In the case of the undersampled 
arrays (d = A and 2A), we see the same structure (beamwidth) around the look direction 
but also note the additional peaks in the beampattern (0 dB) at +90° for d = A and in even 
closer for d = 2. These additional lobes in the beampattern are known as grating lobes. 
Grating lobes create spatial ambiguities; that is, signals incident on the array from the angle 
associated with a grating lobe will look just like signals from the direction of interest. The 
beamformer has no means of distinguishing signals from these various directions. In certain 
applications, grating lobes may be acceptable if it is determined that it is either impossible 
or very improbable to receive returns from these angles; for example, a communications 
satellite is unlikely to receive signals at angles other than those corresponding to the ground 
below. The benefit of the larger element spacing is that the resulting array has a larger 
aperture and thus better resolution, which is our next topic of discussion. The topic of larger 
apertures with element spacing greater than 1/2 is commonly referred to as a thinned array 
and is addressed in Problem 11.5. 


Array aperture and beamforming resolution 


The aperture is the finite area over which a sensor collects spatial energy. In the case 
of a ULA, the aperture is the distance between the first and last elements. In general, the 
designer of an array yearns for as much aperture as possible. The greater the aperture, the 
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FIGURE 11.11 
Beampatterns of a spatial matched filter for different element spacings with an equal-sized 
aperture L = 10). 


finer the resolution of the array, which is its ability to distinguish between closely spaced 
sources. As we will see in Section 11.7, improved resolution results in better angle estimation 
capabilities. The angular resolution of a sensor array is measured in beamwidth A@, which 
is commonly defined as the angular extent between the nulls of the mainbeam Ad@,,, or the 
half-power points of the mainbeam (—3 dB) A@3 gg. As a general rule of thumb, the —3-dB 
beamwidth for an array with an aperture length of L is quoted in radians as 


Xr 
A¢3ap © L (11.2.20) 


although the actual —3-dB points of a spatial matched filter yield a resolution of A@3 gp = 
0.89 4/L (the resolution of the conventional matched filter near broadside, 6 = 0°). The 
approximation in (11.2.20) is intended for the full range of prospective beamformers, not 
just spatial matched filters.’ Since the resolution is dependent on the operating frequency 
F, or equivalently on the wavelength, the aperture is often measured in wavelengths rather 
than in absolute length in meters. At large operating frequencies, say, F. = 10 GHz or 
A = 3 cm (X band in radar terminology), it is possible to populate a physical aperture of 


fixed length with a large number of elements, as opposed to lower operating frequencies, 
say, F, = 300 MHz or dA = 1 m. 


"Tapered beamformers, as discussed in Section 11.2.2, may considerably exceed this approximation, particularly 
for large tapers. 
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We illustrate the effect of aperture on resolution, using a few representative beampat- 
terns. Figure 11.12 shows beampatterns for M = 4, 8, 16, and 32 with interelement spacing 
fixed at d = 4/2 (nonaliasing condition). Therefore, the corresponding apertures in wave- 
lengths are D = 2, 4/, 8A, and 16d. Clearly, increasing the aperture yields better resolu- 
tion, with a factor-of-2 improvement for each of the successive twofold increases in aperture 
length. The level of the first sidelobe is always about —13 dB below the mainlobe peak. 
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FIGURE 11.12 


Beampatterns of a spatial matched filter for different aperture sizes with a common element of 
spacing of d = A/2. 


11.2.2 Tapered Beamforming 


The spatial matched filter would be perfectly sufficient if the only signal present, aside from 
the sensor thermal noise, were the signal of interest. However, in many instances we must 
contend with other, undesired signals that hinder our ability to extract the signal of interest. 
These signals may also be spatially propagating at the same frequency as the operating 
frequency of the array. We refer to such signals as interference. These signals may be 
present due to hostile adversaries that are attempting to prevent us from receiving the signal 
of interest, for example, jammers in radar or communications; or they might be incidental 
signals that are present in our current operating environment, such as transmissions by 
other users in a communications system or radar clutter. In Sections 11.3, 11.5, and 11.6, 
we outline ways in which we can overcome these interferers by using adaptive methods. 


However, there are also nonadaptive alternatives that can be employed in certain cases, 
namely, the use of a taper with the spatial matched filter. 

Consider the ULA signal model from (11.2.15), but now including an interference 
signal i(m) made up of P interference sources 


P 
x(n) = s(n) + i(n) + wn) = VM v(¢,)s(n) + VMS v(¢,)ip(n) + Wn) (1.2.21) 
p=t 
where v(¢,,) and i,(n) are the array response vector and actual signal due to the pth 
interferer, respectively. If we have a ULA with 4/2 element spacing, the beampattern of the 
spatial matched filter, as shown in Figure 11.13, may have sidelobes that are high enough to 
pass these interferers through the beamformer with a high enough gain to prevent us from 
observing the desired signal. For this array, if an interfering source were present at ¢ = 20° 
with a power of 40 dB, the power of the interference at the output of the spatial matched 
filter would be 20 dB because the sidelobe level at ¢ = 20° is only —20 dB. Therefore, if 
we were trying to receive a weaker signal from @, = 0°, we would be unable to extract it 
because of sidelobe leakage from this interferer. 

The spatial matched filter has weights all with a magnitude equal to 1/./M. The look 
direction is determined by a linear phase shift across the weights of the spatial matched 
filter. However, the sidelobe levels can be further reduced by tapering the magnitudes of 
the spatial matched filter. To this end, we employ a tapering vector t that is applied to the 
spatial matched filter to realize a low sidelobe level beamformer 


Cif (Gs) = t © me (Ps) (11.2.22) 


where © represents the Hadamard product, which is the element-by-element multiplication 
of the two vectors (Strang 1998). We refer to this beamformer as the tapered beamformer. 

The determination of a taper can be thought of as the design of the desired beamformer 
where Cp simply determines the desired angle. The weight vector of the spatial matched 
filter from (11.2.16) has unit norm; that is, CH Cue = |. Similarly, the tapered beamformer 
Ctp¢ is normalized so that 


Choe (sewer (Ps) = 1 (11.2.23) 


The choices for tapers, or windows, were outlined in Section 5.1 in the context of spectral 
estimation. Here, we use Dolph-Chebyshev tapers simply for illustration purposes. This 
taper produces a constant sidelobe level (equiripples in the stopband in spectral estimation), 
which is often a desirable attribute of a beamformer. The best taper choice is driven by 
the actual application. The beampatterns of the ULA are used again, but this time the 
beampatterns of tapered beamformers are also shown in Figure 11.13. The sidelobe levels 
of the tapers were chosen to be —50 and —70 dB.’ The same 40-dB interferer would have 
been reduced to —10 and —30 dB at the beamformer output, respectively. 

However, the use of tapers does not come without a cost. The peak of the beampattern 
is no longer at 0 dB. This loss in gain in the current look direction is commonly referred to 
as a tapering loss and is simply the beampattern evaluated at @,: 


Liaper = |Crot (5) = let Gs )¥ (G5) I? (1.2.24) 


Since the tapering vector was normalized as in (11.2.23), the tapering loss is in the range 
O < Ltaper < 1 with Ltaper = 1 corresponding to no loss (untapered spatial matched filter). 
The tapering loss is the loss in SNR of the desired signal at the beamformer output that 
cannot be recovered. More significantly, notice that the mainlobes of the beampatterns in 
Figure 11.13 are much broader for the tapered beamformers. The consequence is a loss 


‘In practice the tapering sidelobe levels are limited by array element location errors due to uncertainty. This limit 
is often at —30 dB but may be even higher. For illustration purposes we will ignore these limits in this chapter. 
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FIGURE 11.13 
Beampatterns of beamformers with M = 20 with no taper (solid 
line), —50-dB taper (dashed line), and —70-dB taper (dash-dot line). 


in resolution that becomes more pronounced as the tapering is increased to achieve lower 
sidelobe levels. This phenomenon was also treated within the context of spectral estimation 
in Section 5.1. However, its interpretation for an array can better be understood by examining 
plots of the magnitude of the taper vector t, shown in Figure 11.14 for the —50- and —70- 
dB Dolph-Chebyshev tapers. Note that the elements on the ends of the array are given less 
weighting as the tapering level is increased. The tapered array in effect deemphasizes these 
end elements while emphasizing the center elements. Therefore, the loss in resolution for 
a tapered beamformer might be interpreted as a loss in the effective aperture of the array 
imparted by the tapering vector. 
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FIGURE 11.14 
The magnitude levels of the tapered beamforming weights as a 


function of element number for M@ = 20 with no taper (solid line), 
—50-dB taper (dashed line), and —70-dB taper (dash-dot line). 
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EXAMPLE 11.2.2. We illustrate the use of tapers with the spatial matched filter for the extraction 
of a radar signal in the presence of a jamming interference source using a ULA with M = 20 
elements with 4/2 spacing. The desired radar signal is known as a target and is present for only 
one sample in time. Here the target signal is at time sample (range gate) n = 100 and is at d = 0° 
with an array SNR of 20 dB. The jammer transmits a high-power, uncorrelated waveform (white 
noise). The angle of the jammer is ¢; = 20°, and its strength is 40 dB. The additive, sensor 
thermal noise has unit power (0 dB). We generate the jammer signal for NV = 200 samples with 
the MATLAB commands 
v_i = exp(-j*pi*[0:M-1]’*sin(phi_i*pi/180) )/sqrt (M) ; 


i_x=(10% (40/20) ) *v_i* (randn (1,N)+j*randn(1,N))/sqrt (2) 
Similarly, the unit power thermal noise signal is produced by 
w= (randn (M,N) +j*randn (M,N) ) /sqrt (2) 


Two beamformers (steered to ¢ = 0°) are applied to the resulting array returns: a spatial matched 
filter and a tapered beamformer with a —50-dB sidelobe level. The resulting beamformer output 
signals are shown in Figure 11.15. The spatial matched filter is unable to reduce the jammer 
sufficiently to observe the target signal at n = 100. However, the tapered beamformer is able 
to attenuate the jammer signal below the thermal noise level and the target is easily extracted. 
The target signal is approximately 18.5 dB with the —1.5 dB loss due to the tapering loss in 
(11.2.24). 
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FIGURE 11.15 
The output signals of a spatial matched filter and a tapered 
beamformer (—50-dB). 


11.3 OPTIMUM ARRAY PROCESSING 


So far, we have only considered beamformers whose weights are determined independently 
of the data to be processed. If instead we base the actual beamforming weights on the array 
data themselves, then the result is an adaptive array and the operation is known as adaptive 
beamforming. Ideally, the beamforming weights are adapted in such a way as to optimize 
the spatial response of the resulting beamformer based on a certain criterion. To this end, the 
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criterion is chosen to enhance the desired signal while rejecting other, unwanted signals. This 
weight vector is similar to the optimum matched filter from Chapter 6. However, the manner 
in which it is implemented, namely, the methodology of how this equation is successfully 
applied to the array processing problem, is the topic of this and the next three sections. 

This section focuses on optimum array processing methods that make use of the a priori 
known Statistics of the data to derive the beamforming weights. Implicit in the optimization 
is the a priori knowledge of the true statistics of the array data. In Section 11.5, we discuss 
techniques for implementing these methods that estimate the unknown statistics from the 
data. We will use the general term adaptive to refer to beamformers that use an estimated 
correlation matrix computed from array snapshots, while reserving the term optimum for 
beamformers that optimize a certain criterion based on knowledge of the array data statistics. 
We begin by discussing the array signal model that contains interference in addition to the 
desired signal and noise. We then proceed to derive the optimum beamformer, where the 
optimality criterion is the maximization of the theoretical signal-to-interference-plus-noise 
ratio. In addition, we give an alternate implementation of the optimum beamformer: the 
generalized sidelobe canceler. This structure also gives an intuitive understanding of the 
optimum beamformer. Various issues associated with the optimum beamformer, namely, the 
effect of signal mismatch and bandwidth on the performance of an optimum beamformer, 
are discussed in Section 11.4. 

The signal of interest is seldom the only array signal aside from thermal noise present. 
The array must often contend with other, undesired signals that interfere with our ability 
to extract this signal of interest, as described in Section 11.2.2. Often the interference is 
so powerful that even a tapered beamformer is unable to sufficiently suppress it to extract 
the signals of interest. The determination of the presence of signals of interest is known as 
detection, while the inference of their parameters, for example, the angle of arrival ¢,, is 
referred to as estimation. The topic of detection is not explicitly treated here. Rather, we 
seek to maximize the visibility of the desired signal at the array output, that is, the ratio of 
the signal power to that of the interference plus noise, to facilitate the detection process. 
There are several textbooks devoted to the subject of detection theory (Scharf 1991; Poor 
1994; Kay 1998) to which the interested reader is referred. Parameter estimation methods 
to determine the angle of the desired signal are the topic of Section 11.7. 

Consider an array signal that consists of the desired signal s(7), an interference signal 
i(n), along with sensor thermal noise w(), that is, 


x(n) = s(n) + i(n) + W(n) = J Mv(¢,)s(n) +i(n) + w(n) (11.3.1) 


where s(7) is a signal with deterministic amplitude 0, and uniformly distributed random 
phase. The interference-plus-noise component of the array signal is 


Xi¢n(n) = i(n) + win) (11.3.2) 


which are both modeled as zero-mean stochastic processes. The interference has spatial 
correlation according to the angles of the contributing interferers, while the thermal noise 
is spatially uncorrelated. The interference component of the signal may consist of several 
sources, as modeled in (11.2.21). The sensor thermal noise is assumed to be uncorrelated 
with power Oo. The assumption is made that all of these three components are mutually 
uncorrelated. As a result, the array correlation matrix is 


Ry = E{x(n)x" (n)} = Mosv(o,) Vv" (g,) + Ri +Rn (11.3.3) 


where ge is the power of the signal of interest and R; and R, are the interference and noise 
correlation matrices, respectively. The interference-plus-noise correlation matrix is the sum 
of these latter two matrices 


Riwn = R, +071 (11.3.4) 


where Ry = oI since the sensor thermal noise is spatially uncorrelated. 


11.3.1 Optimum Beamforming 


The ultimate goal of the prospective adaptive beamformer is to combine the sensor signals 
in such a way that the interference signal is reduced to the level of the thermal noise while 
the desired signal is preserved. Stated another way, we would like to maximize the ratio of 
the signal power to that of the interference plus noise, known as the signal-to-interference- 
plus-noise ratio (SINR). Maximizing the SINR is the optimal criterion for most detection 
and estimation problems. Simply stated, maximizing the SINR seeks to improve the visibility 
of the desired signal as much as possible in a background of interference. This criterion 
should not be confused with maximizing the SNR (spatial matched filter) in the absence of 
interference. 
At the input of the array, that is, in each individual sensor, the SINR is given by 


o2 


SINRelem = SE (11.3.5) 
Oj a ow 

where a, ae, and o, are the signal, interference, and thermal noise powers in each in- 
dividual element. The SINR at the beamformer output, following the application of the 
beamforming weight vector c, is 

le%s(~m)|?__—_ Ma3 |e vp, |? 
E{\e#xi4n(n)|7} 7 c#Rii nc 
We wish to maximize this array output SINR. First, note that the interference-plus-noise 
correlation matrix can be factored as 


Rie Ligh. (11.3.7) 


SINRout = 


(11.3.6) 


where Lj, is the Cholesky factor of the correlation matrix.’ See Section 6.3 for details. 
Thus, defining 


E=Li,¢ Hos) = Lvs) (11.3.8) 
we can rewrite (11.3.6) as 
Mo2\e44 2 
SINR = el: (11.3.9) 
cc 
Using the Schwartz inequality 
e"F(5) < NEMNV@SIII (1.3.10) 
and substituting (11.3.10) into (11.3.9), we find that 
a Qa 2 
SINRout < Mo? Hel IW@s)IE = Mo? ||¥(¢,)|7 (11.3.11) 


llel| 
Thus, the maximum SINR is found by satisfying the upper bound for (11.3.11), which yields 
SINR = Mo,0" ($,)¥(b5) = Mos lv” (6K, ,V(b5)1 (11.3.12) 


We also see that the same maximum SINR is obtained if we set € = av(¢,) where a 
is an arbitrary constant. In other words, the SINR is maximized when these two vectors 
are parallel to each other and a@ can be chosen to satisfy other requirements. Therefore, 
using (11.3.8), we can solve for the optimum weight vector (Bryn 1962; Capon et al. 1967; 
Brennan and Reed 1973) 


eo = aL 4 ¥(¢,) = aR; |, V(d,) (11.3.13) 


where @ is an arbitrary constant. Thus, the optimum beamforming weights are proportional 


| * : ‘ . > 
to R;,_,v(@;). The proportionality constant @ in (11.3.13) can be set in a variety of ways. 


"Note that any square root factorization Rijn = RI? RE? 


of the correlation matrix can be chosen. 
itn i+n 
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Table 11.1 gives various normalizations for the optimum beamformer. The normalization 
we adopt throughout this chapter is to constrain the optimum beamformer to have unity 
gain in the look direction, that is, cl v(@,) = 1. Therefore, 


cf v(¢,) = afRi',V(b,)1 vo.) = 1 (11.3.14) 
and the resulting optimum beamformer is given by 
Riv (bs) 


= 11.3.15 
° vA RL V(b) : 
In general, the normalization of the optimum beamformer is arbitrary and is dictated by the 
use of the output, for example, measure residual interference power or detection. In any 
case, the SINR is maximized independently of the normalization. The most commonly used 
normalizations are listed in Table 11.1. 


TABLE 11.1 
Optimum weight normalizations for unit gain in look direction, unit 
gain on noise, and unit gain on interference-plus-noise constraints. 


Mathematical Optimum beamformer 
Constraint formulation normalization 
MVDR (unit gain c4v(¢,) =1 a = [vt (Ra vVGs)I-! 
in look direction) 
Unit noise gain ce, =] c= [vi (@)R 7 vos)? 
Unit gain on CARiinto=1 a = [v4 ROL v(bs)1 1? 


interference-plus-noise* 


*This normalization is commonly referred to as the adaptive matched filter normalization 
(Robey et al. 1992). Its use is primarily for detection purposes. Since the output level of the 
interference-plus-noise has a set power of unity, a constant detection threshold can be used 
for all angles. 


Alternately, the optimum beamformer can be derived by solving the following con- 
strained optimization problem: Minimize the interference-plus-noise power at the beam- 
former output 


Piyn = Efle’ xin (n)|7} = 7 Rigne (11.3.16) 
subject to a look-direction distortionless response constraint, that is, 
min Pitn subject to ce v(¢,) =1 (11.3.17) 


The solution of this constrained optimization problem is found by using Lagrange multipliers 
(see Appendix B and Problem 11.7) and results in the same weight vector as (11.3.15). This 
formulation has led to the commonly used term minimum-variance distortionless response 
(MVDR) beamformer. For a discussion of minimum-variance beamforming, see Van Veen 
(1992). The optimum beamformer passes signals impinging on the array from angle ¢, while 
rejecting significant energy (interference) from all other angles. This beamformer can be 
thought of as an optimum spatial matched filter since it provides maximum interference 
rejection, while matching the response of signals impinging on the array from a direction 
¢é,. The optimal weights balance the rejection of interference with the thermal noise gain 
so that the output thermal noise does not cause a reduction in the output SINR. 

The optimum beamformer maximizes the SINR given by (11.3.12), which is indepen- 
dent of the normalization. Another useful metric is a measure of the performance relative 
to the interference-free case, that is, x(n) = s(n) + w(n). To gauge the performance of 
the beamformer independently of the desired signal power, we simply normalize the SINR 


by the hypothetical array output SNR had there been no interference present, which from 645 
(11.2.18) is SNRo = Mo?/o2,. The resulting measure is known as the SINR loss, which section 11.3 
for the optimum beamformer, by substituting into (11.3.12), is Optimum Array Processing 


SINRou(¢,) _ 
SNRjp 


The SINR loss is always between 0 and 1, taking on the maximum value when the perfor- 
mance is equal to the interference-free case. Typically, the SINR loss is computed across 
all angles for a given interference scenario. In this sense, the SINR loss of the optimum 
beamformer provides a measure of the residual interference remaining following optimum 
processing and informs us of our loss in performance due to the presence of interference. 
We also notice that (11.3.18) is the reciprocal of the minimum-variance power spectrum of 
the interference plus noise. Minimum-variance power spectrum estimation was discussed 
in Section 9.5. 


Leint 5) = ov (pK nV (os) (11.3.18) 


EXAMPLE 11.3.1. To demonstrate the optimum beamformer, we consider a scenario in which there 
are three interference sources and compare it to a conventional beamformer (spatial matched 
filter). The array is a 20-element ULA with 4/2 element spacing. These interferers are at the 
following angles with the corresponding interference-to-noise ratios (INRs) in decibels: @ = 20° 
and INR= 35 dB, @ = —30° and INR= 70 dB, and ¢@ = 50° and INR= 50 GB. The optimum 
beamformer is first computed using (11.3.15) for a look direction of @, = 0°. The beampattern 
of this optimum beamformer is computed by using (11.2.3) and is plotted in Figure 11.16(a). 
Notice the nulls at the angles of the interference (@ = —30°, 20°, 50°). These nulls are deep 
enough that the interference at the beamformer output is below the sensor thermal noise level. The 
conventional beamformer, however, cannot place nulls on the interferers since it is independent 
of the data. We also perform optimum beamforming across all angles —90° < @ < 90° and 
compute the corresponding SINR loss due to the interference using (11.3.18). The SINR loss 
is plotted in Figure 11.16(b). The notches at the interference angles are simply the negative of 
the INR of the interferers corresponding to significant losses in performance. However, these 
performance losses are limited to these angles. The SINR loss at all other angles is almost at 
its maximum value of 1 (0 dB). The SINR loss of the conventional beamformer is significantly 
worse at all angles because of the strong interference that makes its way to the beamformer 
output through its sidelobes. 
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(a) Beampattern for ¢, = 0° (b) SINR loss for —90° < @, < 90° 
FIGURE 11.16 


Beampattern (steered to @ = 0°) and SINR loss plots versus angle. Solid line is the optimum 
beamformer, and dashed line is the conventional beamformer. 


EXAMPLE 11.3.2. We revisit the problem from Example 11.2.2 with a jammer at 6; = 20° 
except the jammer power is now 70 dB. Clearly, the —50-dB tapered beamformer is no longer 
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capable of sufficiently suppressing this jammer. Rather, we compute the optimum beamformer 
using (11.3.15), where Ri4y = 10’v(o,)v™ (¢;) + I. First, we examine the beampattern of the 
optimum beamformer steered to ¢ = 0° in Figure 11.17(a). Notice the null on the jammer at 
¢@ = 20° with a depth of greater than —150 dB. We also plot the SINR loss in Figure 11.17(b) 
as we scan the look direction from —90° to 90°. Almost no SINR loss is experienced at angles 
away from the jammer, while at the jammer angle @ = 20°, the SINR loss corresponds to the 
jammer power (70 dB). As a similar exercise to that in Example 11.2.2, we can produce a target 
signal at @ = 0° and attempt to extract it, using both a spatial matched filter and an optimum 
beamformer. The output signals are shown in Figure 11.17(c) and (d), respectively. The optimum 
beamformer is able to successfully extract the signal whereas the ouput of the spatial matched 
filter is dominated by interference. Notice that we do not suffer the taper loss on the target as we 
did for the tapered beamformer due to the ct v(¢,) = 1 constraint in the optimum beamformer. 
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(c) Output signal of spatial matched filter 


(d) Output signal of optimum beamformer 


FIGURE 11.17 


(a) Beampattern and (b) SINR loss of optimum beamformer (solid line) versus spatial matched filter 


(dashed line), along with (c) the output signals of a spatial matched filter and (d) the optimum 
beamformer. 


11.3.2 Eigenanalysis of the Optimum Beamformer 


In many cases, significant insight can be gained by considering the optimum beamformer in 
terms of the eigenvalues and eigenvectors of the interference-plus-noise correlation matrix 


M 
Ritn = DS dm Om Ge 


m=1 


(11.3.19) 


where the eigenvalues have been ordered from largest to smallest, that is,A; >A2 > ---: => 
Xm. If the rank of the interference is P, then Aj, = ae for m > P; that is, the remainder 
of the eigenvalues is equal to the thermal noise power. The eigenvectors are orthonormal 
(qi qx = Ofork Am, qi? Qm = 1) and form a basis for the interference-plus-noise subspace 
that can be split into interference and noise subspaces given by 


Interference subspace: {qm 1 <m < P} Noise subspace: {qn P <m < M} 
(11.3.20) 


The inverse of Rj, can also be written in terms of the eigenvalues and eigenvectors, A. 
and Qm, of the correlation matrix Rj+,, that is, 


M 
1 
—l 
R=) 5nd (1.3.21) 
m=1 


We further assume that the rank of the interference is less than the total number of sensors, 
that is, P < M. In this case, the smallest eigenvalues of Rj; are noise eigenvalues and 
are equal to the thermal noise power Aj. = oe. form > P. Substituting (11.3.21) into the 
optimum beamformer weights in (11.3.15) , we have 


M 
= 1 
Co = ARVs) = & DY) Ann V(bs) 
m=1 m 


1 1 at vos) 
=a [vw = oz, Vhs) + a ae (11.3.22) 


m=1 


a x hin — 02 
=a {ro -> ‘aE in| 


m=1 


where a = [v(o,)7 Rov )I!. The resulting beam response is 


a ted Am — o2 
Co(¢) = az {eww - dX ee ¥6)100(0| (1.3.23) 
where Cab) = v" (bs) V(b) = Cee v(b) = Cut (P) (1.3.24) 


is the response of the spatial matched filter ¢m¢(@,) = V(¢,) [see (11.2.16)] and is known 
as the quiescent response of the optimum beamformer. However, 


On($) = ae v(g) (1.3.25) 


is the beam response of the mth eigenvector, known as an eigenbeam. Thus, the response of 
the optimum beamformer consists of weighted eigenbeams subtracted from the quiescent 
response. The weights for the eigenbeams are determined by the corresponding eigenvalue, 
the noise power, and the cross-product of the look-direction steering vector and the re- 
spective eigenvector. Examining the term (Aj, — o>.) /Am, we see clearly that for strong 
interferers A», >> ae, and (Am — a2) /Am © 1, and the eigenbeam is subtracted from the 
quiescent response weighted by qi? v(¢). This subtraction of properly weighted interference 
eigenvectors places nulls in the directions of the interference sources. The term qi v(¢,) in 
(11.3.23) scales the interference eigenbeam to the quiescent response of the spatial matched 
filter in the direction of the corresponding interferer. Thus, the null depth for an interferer of 
the beampattern |C,(¢)|? is determined by the response of the eigenbeam to the quiescent 
response and the strength of the interferer relative to the noise level. However, for the noise 
eigenvalues Aj, = ae, and (Am — o2,) /Am = 0. Therefore, the noise eigenvectors have no 
effect on the optimum beamformer. Interestingly, for the case of noise only and thus all 
noise eigenvalues, that is, no interference present, the optimum beamformer reverts to the 
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spatial matched filter c,(¢,) = Cme(¢s) = V(@,), which is the beamformer that maximizes 
the SNR. 


11.3.3 Interference Cancelation Performance 


The interference cancelation performance of the optimum beamformer can be determined 
by examining the beam response at the angles of the interferers. The beam response at these 
angles indicates the depth of the null that the optimum beamformer places on the interferer. 
Using the MVDR optimum beamformer from (11.3.15), we see that the response in the 
direction of an interferer ¢,, of an optimum beamformer that is steered in direction ¢, is 


Colby) = ef v(b,) = av" (b,)Ri, Vp) (1.3.26) 


where & 7 is the angle of the pth interferer and a = [v7 @ Ri), v(¢, 7. Now we note 
that Rj, can be split into a component due to the pth interferer and the correlation matrix 
of the remaining interference-plus-noise Qi+p 


Risn = Qin + Mo7v(o,)v" (bp) (11.3.27) 


where o%, is the power of the pth interferer in a single element. Using the matrix inversion 
lemma (Appendix A), we obtain 


Ro! = Qc), — Mo? QisnV(OpI¥ Op) Qin 
1+ Motiv" (b,)Q,, Vbp) 


Substituting (11.3.28) into (11.3.26), we find the optimum beamformer response to be 
(Richmond 1999) 


Colby) 


(11.3.28) 


av (p, Ri. V(bp) 
= av" (¢,)Q7 |. v(b,) 
= av (QE Vp V4 (b QV Op) 


2 
& mes (1.3.29) 
1+ Mo2v"(b,)Q), Vp) 
VEG )Q Vp) 1 


v4 ($)RinV(bs) 1+ Mo2v" (b,)Q,1, Vp) 
—_— << ——-— YS ucrmrn— 


term | term 2 

We notice that the optimum beamformer response is made up of the product of two terms. 
The first term is the response at angle ¢,, of an optimum beamformer steered in direction ¢, 
formed in the absence of this interferer (0%, = 0), that is, the sidelobe level of the optimum 
beamformer had this interference not been present. However, the power of the interferer 
is many times significantly greater than this sidelobe level, and the optimum beamformer 
cancels the interferer by placing a null at the angle of the interferer. The second term 
produces the null at the angle @,. By examining this term, it is apparent that the depth of 
the null is determined by the power of the interferer Mo,. Clearly, the larger the power of 
the interferer, the smaller this term becomes and the deeper the null depth of the optimum 
beamformer is at @ p The factor v (d PQA ve p) is the amount of energy received from 
¢ p not including the interferer and has as a lower bound equal to the thermal noise power 
(spatially white). Since the power response of the beamformer is |Co(¢,) |, the null depth 
is actually proportional to M*o4, or twice the power of the interferer at the array output, 
in decibels (Compton 1988). 


11.3.4 Tapered Optimum Beamforming 


In the derivation of the optimum beamformer, we used the vector v(@, ) that was matched to 
the array response of a desired signal arriving from an angle ¢,. The resulting beamformer 
weight vector cy, has unity gain in this direction; that is, ce v(¢,) = 1, owing to the 
normalization of the weights. However, the sidelobes of the beamformer are still at the same 
levels as the spatial matched filter (nonadaptive beamformer) from (11.2.16), although with 
a different structure, as can be seen from a sample beampattern of the optimum beamformer 
shown in Figure 11.18(a). 
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(a) Optimum beamformer (no taper) (b) Tapered optimum beamformer (—50-dB taper) 


FIGURE 11.18 
Beampatterns of an optimum beamformer (a) without and (b) with tapering (—50-dB 
Dolph-Chebyshev taper) steered to @ = 0°. 


The optimum beamformer uses the interference-plus-noise correlation matrix Rj+p. 
Now, although the beamformer weights must be estimated from intervals of the data that 
contain only interference (no desired signal present), they are presumably applied to seg- 
ments that contain both interference and a desired signal. What happens when we are 
searching an angular region for potential desired signals? A desired signal at an angle ¢, 
may be easily found by using an adaptive beamformer directed to this angle (¢, = $1), as- 
suming the signal strength after beamforming is significantly larger than the sensor thermal 
noise. However, we will also be searching other angles for potential desired signals. If we 
are looking at one of these other angles, say, 62 4 @;, we want to avoid concluding a signal 
is present when it may actually be due to sidelobe leakage of the signal at 6,. This problem 
is best illustrated by using the beampattern of an optimum beamformer with an interferer at 
@ = 20° in Figure 11.18(a). The optimum beamformer is steered to an angle ¢, = 0°. Let 
us assume another signal is present at @; = —20° that was not part of the interference (not 
accounted for in the interference correlation matrix). The gain of the optimum beamformer 
at @ = —20° is approximately —20 dB. If the strength of this signal is significantly greater 
than 20 dB, the optimum beamformer steered to ¢, = 0° will pass this sidelobe signal with 
sufficient strength that we may erroneously conclude a signal is present at ¢, = 0°. This 
problem is commonly referred to as a sidelobe target or desired signal problem. 

The sidelobe signal problem described above can be cured, at least partially, by reduc- 
ing the sidelobe levels of the beamformer to levels that sufficiently reject these sidelobe 
signals. As we described in Section 11.2.2, the application of a taper to a spatial matched 
filter resulted in a low sidelobe beampattern. The same principle applies to the optimum 
beamformer. We define a tapered array response vector at an angle @, as 


Vi(hs) = Cur (hs) = t © eme (Gs) (11.3.30) 


649 


SECTION 11.3 
Optimum Array Processing 


650 


CHAPTER 11 
Array Processing 


where t is the tapering vector and © is the Hadamard or element-by-element product. The 
tapering vector is normalized such that vit (@s)Vt(s) = 1 as in (11.2.23). The resulting 
low sidelobe adaptive beamformer is given by substituting v,(¢,) for v(@,) in (11.3.15) 
Ri Vebs) 

vit (Rag nVi(bs) 
We again use the Dolph-Chebyshev taper for illustration purposes because this choice of ta- 
per provides a uniform sidelobe level. Other choices include the window functions discussed 
in Chapter 5 in the context of spectrum estimation. Consider the optimum beamformer with 
an interferer at @ = 20° from Figure 11.18(a) with a potential signal leaking through the 
sidelobe at ¢ = —20°. If instead we use a tapered optimum beamformer from (11.3.31) 
with a —50-dB sidelobe taper, a potential signal at 6 = —20° receives a —50-dB level of 
attenuation. Figure 11.18(b) shows the beampattern of this tapered optimum beamformer. 
The sidelobe levels are significantly reduced while the null on the interferer at ¢ = 20° has 
been maintained. 

The adaptive beamformer given by (11.3.31) is no longer optimal in any sense [unless it 
were somehow possible for our desired signal to be spatially matched to v;(¢, )]. However, 
the resulting adaptive beamformer still provides rejection of unwanted interferers via spatial 
nulling through the use of Ri, in (11.3.31). In addition, the low sidelobe levels of the 
beampattern reject signals not contained in the interference that are present at angles other 
than the angle of look ¢,. The penalty to be paid for the robustness provided by these low 
sidelobes is a small tapering loss in the direction of the look ¢, given by 


WI @ IRL) | WY @ Rah (b,)!? 
vit (Ra nVe(ps) vit (Ri aVt(ps) 


and a widening of the mainlobe beamwidth, as can be seen in the beampattern in Figure 
11.18. This tapering loss indicates a mismatch between the true signal and the constraint in 
the optimum beamformer. 


(11.3.31) 


Cto 


Lope SVG r= (1.3.32) 


11.3.5 The Generalized Sidelobe Canceler 


We have shown that the optimum MVDR beamformer maximizes the output SINR and can 
be formulated as a constrained optimization given by 


minc’Ri,ne ~~ subjectto §~— ee“ v(¢,) = 1 (11.3.33) 


which results in the MVDR beamformer weight vector 
cy = — Risa V(6s) 

v4 (o,)Ri,,V(bs) 
This problem formulation can be broken up into constrained and unconstrained components 
that give rise to both an alternate implementation and a more intuitive interpretation of the 
optimum beamformer. The resulting structure, known as the generalized sidelobe canceler 
(GSC) (Griffiths and Jim 1982), uses a preprocessing stage to transform the optimization 
from constrained to unconstrained (Applebaum and Chapman 1976; Griffiths and Jim 1982). 
The GSC structure is illustrated in Figure 11.19. 

Consider the array signal x(n) from (11.3.1) consisting of a signal component s(7) and 
an interference-plus-noise component xj;n(”). We are interested in forming the optimum 
beamformer steered to the angle ¢,. Let us start by forming a nonadaptive spatial matched 
filter in this direction Cnr = v(¢,). The resulting output is the main channel signal given 
by 


(11.3.34) 


yo(n) = c#.(g,)x(n) = v" (p,)x(n) = so(n) + i9(n) + wo(n) (11.3.35) 


yo(”) = So(n) + ig(n) + wo(n) 


y(n) 


M 
—— = M-dimensional 


FIGURE 11.19 
Generalized sidelobe canceler. 


This nonadaptive beamformer makes up the upper branch of the GSC. In addition, let us 
form a lower branch consisting of M—1 channels in which the unconstrained optimization is 
performed. To prevent signal cancelation according to the unity-gain constraint in (11.3.33), 
we must ensure that these M — 1 channels do not contain any signals from @,. To this end, 
we form an (M — 1) x M signal blocking matrix B that is orthogonal to the look-direction 
constraint v(@,) 


B“v(¢,) = 0 (11.3.36) 
The resulting output of the blocking matrix is the (M — 1) x 1 vector signal 
xp(n) = B? x(n) (11.3.37) 


Thus, several choices for the blocking matrix exist that can perform this projection onto 
the (M — 1)-dimensional subspace orthogonal to v(¢,). One choice uses a set of M — 1 
beams that are each chosen to satisfy this constraint. The spatial frequency of the ULA for 
an angle @, is 


d 
us = 7 sin d, (11.3.38) 
For v(¢,), spatial matched filters at the frequencies 
m 
Um = Us + u (11.3.39) 
form = 1,2,..., M — 1 are mutually orthogonal as well as orthogonal to v(@,), that is, 
v" (bm)V(bs) = 0 (11.3.40) 


where ¢@,,, is the angle corresponding to the spatial frequency u, given by (11.3.39). Thus, 
we can construct a beamspace signal blocking matrix from these M — 1 steering vectors 


B = [v(w1) v(u2) --- Vum-1)] (11.3.41) 


An alternative signal blocking matrix can be implemented, assuming the array is presteered 
to the angle ¢, (Griffiths and Jim 1982). Presteering is accomplished by phase-shifting each 
element of the array by the corresponding steering vector element to this angle without 
actually forming a summation. Then any blocking matrix for which the elements of each 
column sum to zero will satisfy (11.3.36). 

Once the nonadaptive preprocessing has been performed for the upper and lower 
branches of the GSC, an unconstrained optimization can be performed in the lower branch. 


‘Although the optimum beamformer was formulated for a signal-free interference-plus-noise correlation matrix 
R;+y, it is possible that the presence of the desired signal in the data is unavoidable. 
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Using the M — 1 channels in the lower branch, we want to estimate the undesired portion 
of the upper branch signal yo(7) due to interference. This interference is presumed to arrive 
at the array from a different angle than @, so that it must be contained in the lower branch 
signal as well. Thus, we need to compute an adaptive weight vector for the lower branch 
channels that forms an estimate of the interference in the upper branch. The estimated inter- 
ference is subtracted from the upper branch. This problem is the classical MMSE filtering 
problem (see Chapter 6), whose solution is given by the Wiener-Hopf equation 


cp = Ry'rp (11.3.42) 


where Rg = E{xp (n)xf (n)} is the lower branch correlation matrix and rg = E{xp Yo (n)} 
is the cross-correlation vector between the upper and lower branch signals. The resulting 
estimate of the upper branch interference signal is 


ig(n) = cf xp(n) (11.3.43) 
and the output of the GSC is 


y(n) = yo(n) — in(n) = yo(n) — eg xB () (1.3.44) 
As we stated earlier, the GSC is equivalent to the optimum beamformer; that is, it maximizes 
the SINR at its output for signals arriving at angle ¢,. The power of the GSC formulation lies 
in its interpretation. Whereas for the optimum beamformer, the interference was canceled 
by forming spatial nulls in the directions of interferers, the GSC can be visualized as 
estimating the interference component in the upper branch from the lower-branch signals. 
Of course, the GSC also forms spatial nulls in the directions of the interferers. In terms of 
an alternate implementation, one must consider that if we are to steer the array to a number 
of different angles, each direction will require the formation of a new blocking matrix and 
the computation of a different correlation matrix and cross-correlation vector for the GSC. 
On the other hand, the optimum beamformer formulation has the same correlation matrix 
independent of the direction to which it is steered and, therefore, is often preferred for imple- 
mentation purposes. 


11.4 PERFORMANCE CONSIDERATIONS FOR OPTIMUM BEAMFORMERS 


In this section we look at some considerations that influence the performance of an optimum 
beamformer. These considerations are also applicable to the adaptive methods in Section 
11.5 that are derived from the optimum beamformer. Since the optimum beamformer serves 
as an upper bound on the performance of any adaptive method, these considerations can 
serve as adjustments to this performance bound for the adaptive counterparts to the optimum 
beamformer. 

Two major factors that affect the performance of an optimum beamformer are: 


e Mismatch of the actual signal to the assumed signal model used by the optimum beam- 
former 
e Bandwidth of the signal that violates the narrowband assumption. 


In the first section, we look at the effects of differences in the actual signal from that assumed 
for the optimum beamformer, known as signal mismatch. In virtually all array processing 
implementations, some level of mismatch will exist, due to either uncertainty in the exact 
angle of arrival of the signal of interest or the fact that the locations and characteristics of 
the individual sensors differ from our assumptions. As we will see, these errors that produce 
a signal mismatch can have profound implications on performance, particularly when the 
signal of interest is present in the correlation matrix. Next, we look at the effects of wider 
bandwidths on the performance of the optimum beamformer. In many applications, certain 
requirements necessitate the use of larger bandwidths. Their impact and possible means of 
correction are discussed in this section. 


11.4.1 Effect of Signal Mismatch 


In our formulation of the optimum beamformer, we assumed that a signal arriving at the 
array from an angle ¢, would produce a response equal to the ideal steering vector for a 
ULA [see (11.1.19)]. Thus, the optimum beamformer constrained its response to be spatially 
“matched" to the array response of the signal v,; = v(¢,) = Vo where @o = ¢, 


cf#yy =c#y, =1 (11.4.1) 


that is, to pass it with unity gain. The vector vo is the assumed array response. However, 
in reality, the signal may exhibit a different response across the array or may arrive from 
another angle 6, 4 ¢o. The differences in response arise due to distortion of the waveform 
during propagation, amplitude and phase mismatches between the individual sensors, or 
errors in the assumed locations of the sensors.’ These mismatches manifest themselves in a 
deviation of the array response from that assumed for a ULA in (11.1.19). However, if the 
angle of arrival of the signal differs from the assumed angle, the result is an array response 
as in (11.1.19), but for the angle ¢, as opposed to the steering angle ¢. In either case, the 
beamformer is mismatched with the signal of interest and is no longer optimum. In this 
section, we examine the effect of these mismatches on the performance of the optimum 
beamformer, for the case of the signal of interest contained in the correlation matrix and 
absent from it. As we will see, the inclusion of this signal of interest in the correlation matrix 
has profound implications on the performance of a mismatched optimum beamformer. The 
analysis that follows was originally reported by Cox (1973). 

Consider the case of an array signal consisting of a signal of interest s(), interference 
i(n), and thermal noise w(n) 


x(n) = s(n) +i(n) + W(n) (11.4.2) 
where the noise is assumed to be uncorrelated, that is, Ry = oI. Now let us assume that 
the signal of interest is given by 

s(n) = V Ms(n)us (11.4.3) 


where u,, with unit norm (ut u; = 1), is the true array response to the signal of interest. 
For generality, us may be either an ideal or a nonideal array response for a ULA of a signal 
arriving from angle @¢,, but in either case it is mismatched with the assumed response 


us # Vo (11.4.4) 

The correlation matrix of the signal x(”) is made up of components due to the signal and 
the interference-plus-noise 

R, = E{x(n)x" (n)} = Mo2usu! + Risn (11.4.5) 


where the signal power is o2 = |s(n)|?. The optimum beamformer with an MVDR con- 
straint for the signal s(7) in (11.3.15) is 


(11.4.6) 


However, the true array response uy is unknown. This optimum beamformer in (11.4.6) 
yields the maximum output SINR given by 


SINR, = Mo?ul! Reus = SNRo- Leinr (11.4.7) 


where SNRo = Mo?/o?, is the matched filter SNR in the absence of interference from 


(11.2.18) (best performance possible) and Line = oz ult Rai u, is the SINR loss from 


' Similar losses also result from using a tapered steering vector. This loss was shown for the tapered optimum 
beamformer for the case of the signal of interest not present in the correlation matrix. As we will show in this 
section, the inclusion of the signal of interest in the correlation matrix can cause substantial losses in such a tapered 
beamformer. 
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(11.3.18) due to the presence of the interference. Thus, we can evaluate the losses due to 
signal mismatch and the inclusion of the signal of interest in the correlation matrix with 
respect to the maximum SINR in (11.4.7). 


Loss due to signal mismatch 


First, let us consider a mismatched signal vo 4 us without the signal of interest present 
in the correlation matrix. The mismatch arises due to our lack of knowledge of the true array 
response to the signal of interest u;. The computation of the beamformer weights, assuming 
the array response to the signal to be vp with an MVDR normalization, is given by 


= (11.4.8) 


The SINR at the beamformer output for this weight vector is given by 


H Hp-!l 2 
|e] s(n)|? 2 Ivo R, Us| 
SINR; = ———— = = 
Cc, Rint Vo R,_,Vo 
Hp-l 2 
pte Ivo R, Us| (11.4.9) 
= Mojyu,’R,, us ——> a 
(Vo R;,_, Vo) (us R,,,Us) 


= SINR, « cos?(vo, uy; Ry) 


where the term cos(-) measures the cosine of a generalized angle between two vectors a 
and b weighted by matrix Z (Cox 1973) 

la” Zb|? 
(a4 Za)(b# Zb) 
This term can be shown to have limits of 0 < cos?(a, b; Z) < 1 through the Schwartz 


inequality. The SINR from (11.4.9) can be rewritten as 
SINR; = SNRo - Line - Lsm (11.4.11) 


cos?(a, b: Z) & (1.4.10) 


where we define the signal mismatch (sm) loss to be 


Lom = cos?(vo, us; R (1.4.12) 


i) 
1+n 
Therefore, the SINR in (11.4.9) is a result of reducing the maximum SNR for a matched filter 
by the SINR loss due to the interference Lin, as well as the loss due to the mismatch Lem. 
To gain some insight into the loss due to mismatch, consider the eigendecomposition 
of Ray given by 
M 


1 
-1 
Roe y amd (11.4.13) 

m=1 


where A,, and q,, are the eigenvalue and eigenvector pairs, respectively. The largest eigen- 
values and their corresponding eigenvectors are due to interference, while the small eigen- 
values and eigenvectors are due to noise only. Since the eigenvectors form a basis for the 
M-dimensional vector space, any vector, say, Vo Or Us, can be written as a linear combination 
of these eigenvectors. The product of the matrix Rah with any vector closely aligned with 
an interference eigenvector will suffer significant degradation. Therefore, the mismatch loss 
in (11.4.12) should be relatively small for the case of us not closely aligned with interferers. 
Otherwise, if the signal lies near any of the interfererence eigenvectors, the beamformer 
will be more sensitive to signal mismatch. 

Intuitively, performance degradation due to a mismatch in the optimum beamformer is 
relatively insensitive for small mismatches. The beamformer in (11.4.8) attempts to remove 


any energy that is not contained in its unity-gain constraint for vo. Since the signal with 
an array response Uy is not contained in the correlation matrix, the only losses incurred are 
due to the degree of mismatch between u, and vg and the similarity of u, to interference 
components that are nulled through the use of Roi. However, most importantly, the loss 
due to mismatch is independent of the signal strength a. 


Loss due to signal in the correlation matrix 


To implement the optimum beamformer in (11.4.6) in practice, we must assume that we 
can estimate R;,, without the presence of the signal s(7). However, in many applications 
the signal is present all the time so that an estimate of a signal-free correlation matrix is not 
possible. In this case, the optimum beamformer must be constructed with the correlation 
matrix from (11.4.5) and is given by 

-1 
B= ae (1.4.14) 
Vo Rx Vo 

Although this beamformer differs from the beamformer c; in (11.4.8) that does not include 
the signal of interest in the correlation matrix, it produces an identical beamforming weight 
vector in the case when it is perfectly matched to the signal of interest, that is, vo = Us 
(see Problem 11.10). Thus, the beamformer in (11.4.14) also maximizes the SINR in the 
case of a perfectly matched signal. However, we want to examine the sensitivity of this 
beamformer to signal mismatches. The SINR of the beamformer from (11.4.14) with the 
signal present (sp) in the correlation matrix can be shown to be (Cox 1973) 


SINR» = Wo sOl ggates Woe Mle 
cH Risne . VERY RisnRy ‘Vo 
SINR, (1.4.15) 


1 + (2SINR, + SINR?) - sin*(vo, us; Rj) 
= SNRo - Lsinr « Lsm - Lsp 
where SINR; is the SINR of the mismatched beamformer in (11.4.9). The sin(-) term 


measures the sine of the generalized angle between vo and uy and is related to the cos(-) 
term from (11.4.10) by 


sin’(vo, us; Ry, ',) = 1 — cos”(vo, us; Rj) (11.4.16) 
Thus, the SINR of a beamformer constructed with the signal of interest present in the 
correlation matrix suffers an additional loss Lsp, beyond the losses associated with the 


interference Lin, and the mismatch Lsm between uy and vo, which is given by 
= 1 
1+ @SINR, + SINR?) sin? (vo, us; R 


Lsp Fj (11.4.17) 
aa 
Unlike the mismatch loss from (11.4.12), the loss due to the signal presence in the correlation 
matrix with signal mismatch is related to the signal strength o. In fact, (11.4.17) shows a 
strong dependence on the signal strength through the terms SINR, and SINR? in the denom- 
inator. This dependence on signal strength is weighted by the sine term in (11.4.16) that mea- 
sures the amount of mismatch. Thus for large signals, the losses can be significant. In fact, it 
can be shown that the losses resulting from strong signals present in the correlation matrix 
can cause the output SINR to be lower than if the signal had been relatively weak. This phe- 
nomenon along with possible means of alleviating the losses is explored in Problem 11.11. 

We have shown a high sensitivity to mismatch of strong signals of interest when they are 
present in the correlation matrix used to compute the beamforming weights in (11.4.14). 
Since, in practice, a certain level of mismatch is always present, it may sometimes be 
advisable to use conventional, nonadaptive beamformers when the signal is present at 
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all times and does not allow the estimation of a signal-free correlation matrix Rj+y. If 
the performance of such nonadaptive beamformers is deemed unacceptable, then special 
measures such as diagonal loading, which is described in Section 11.5.2, must be taken to 
design a robust beamformer that is less sensitive to mismatch (Cox et al. 1987). 


11.4.2 Effect of Bandwidth 


So far, we have relied on the narrowband assumption, meaning that the bandwidth B of 
the received signals is small with respect to the carrier frequency Fy. Previously, we gave 
a rule of thumb for this assumption, namely, that the fractional bandwidth, defined as 


B= 2 (11.4.18) 
=F A. 
is small, say, B < 1 percent. Another measure is the space-time-bandwidth product, which 
for an array of length L is 


LB 
TBWP = — (11.4.19) 
c 


where the time L/c is the maximum amount of time for a plane wave to propagate across 
the entire array, that is, the maximum propagation delay between the first and last elements 
(@ = £90°, sing = 1). 

However, many real-world applications require increased bandwidths, which cause 
this assumption to be violated (Buckley 1987; Zatman 1998). The question then is, What 
is the effect of bandwidth on the performance of an array? Let us begin by examining the 
narrowband steering vector for a ULA from (11.1.19) 


v(o) 2 = (1 e 2nd sing)/A] . |. e erlies prAle) e (11.4.20) 


which assumes that A is constant, that is, the array receives signals only from a frequency 
F.. Relaxing this assumption and substituting 7 = c/F from (11.1.2) gives us a steering 
vector that makes no assumptions about the bandwidth of the incoming signal 


1 . 7 
v(d, F) = Tie e J27 1d sin p)/c]F Soe gore sev —DE er (11.4.21) 


When we demodulate the received signals by the carrier frequency F., we are making an 
implicit narrowband assumption that allows us to model the time delay between sensor 
elements as a phase shift. Therefore, a wideband signal arriving from an angle ¢ appears to 
the narrowband receiver as if it were arriving from an angular region centered at ¢ (provided 
the spectrum of the incoming signal is centered about F,), since the approximation of the 
delay between elements as a single phase shift no longer holds. This phenomenon is known 
as dispersion since the incoming wideband signal appears to disperse in angle across the 
array. 

Let us examine the impact of a wideband interference signal on the performance of 
an adaptive array. The correlation matrix of a single interference source impinging on the 
array from an angle ¢ is found by integrating over the bandwidth of the received signal 


Fo+B/2 


o2 
R= > / v(¢, F)v" (6, F) dF (11.4.22) 


F._B/2 
where the assumption is made that the spectral response of the signal is flat over the band- 
width, that is, |R(F)|? = 1|for F,—B/2 < F < F,+ B/2. Now, focusing on the individual 
elements of the correlation matrix, namely the (m, n)th element 


Fe+B/2 
eJ2xm{(d sin $)/clF —j2xnl(d sin d)/clF gp 


beets 


(Ri) mn = 
F.—B/2 
Fo+B/2 
ei2a(m—ni(d sin b)/c] F dF 


byob 


BocHie (1.4.23) 


anes] 
2 sin | 277 (m — n) = 
€ 


= o2ei2rim—mU(dsind)/c1F. 2 


name 
Cc 


a o2ei2mm—ml(a sin $)/Al ging lon - ne sing B| 
c 


where sinc(x) = sin(zx)/(2x). We notice that each element is made up of two terms. The 
first term is simply the cross-correlation between the mth and nth sensor array elements for 
a narrowband signal arriving from @ 


(Ri?) Vien =— Geary sin )/d) (11.4.24) 
where the superscript indicates that this is the narrowband correlation matrix. The second 
term represents the dispersion across the array caused by the bandwidth of the interferer 


and is given by 
i dsin d 
(Ra)m.n = sinc | (m —n)———B (11.4.25) 
c 


Using (11.4.25), we can construct a matrix that models this dispersion across the entire array, 
which we refer to as the dispersion matrix. Dispersion creates decorrelation of the signal 
across the array, and this term represents the temporal autocorrelation of the impinging 
signal. Therefore, we can write the wideband correlation matrix as the Hadamard product 
of the narrowband correlation and the dispersion matrices 


Ro = rR” ORg (11.4.26) 


where the Hadamard product is a point-by-point multiplication (Strang 1998). 

The dispersion produced by a wideband signal can be compensated or corrected for at its 
specific angle by using a technique known as time-delay steering. Notice that the dispersion 
term in (11.4.25) is (Ra)m.n = 1 for @ = O° since the argument of the sinc function is 
zero. Therefore, for signals arriving from ¢ = 0° no dispersion can occur. In other words, 
for signals arriving from broadside to the array (¢ = 0°), the delay between elements is 
zero, independent of the frequency of the signal or its bandwidth. The dispersion becomes 
worse as the value of the angle is increased. This suggests a simple remedy to correct 
for dispersion: refocus the array to angle @. Steering the array in this direction involves 
time-delaying each element to compensate for its delay between elements explicitly. This 
time-delay steering can be implemented in analog or digitally and is illustrated in Figure 
11.20. The time-delay steered array signal is 


Xia(t) = Lei (1) xo(t — 12) +++ xu (t— tw)" (1.4.27) 


where T,, = (d/A)(m — 1) sin @ with 4 being the wavelength of the center frequency Fy. 
Thus, a signal arriving from this angle will have no delay between elements following the 
time-delay steering. A convenient means of modeling time-delay steering in the discrete- 
time signal is through application of the matrix 


V = diag{v($)} (11.4.28) 
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FIGURE 11.20 
Time-delay steering prior to beamforming (referenced to the first element, 
Tj = 0). 


to the array signal x(7) as 

Xiq(n) = V" x(n) (11.4.29) 
The resulting interference correlation matrix is 

RM — VER V (11.4.30) 


Time-delay steering will focus signals from angle ¢ but may in fact increase the amount 
of dispersion from other angles. However, if we are not looking at these other angles, this 
effect may not be noticed. The underlying phenomenon that is occurring is that an optimum 
beamformer is forced to use additional adaptive degrees of freedom to cancel the dispersed, 
wideband interference signals. As long as the optimum beamformer has sufficient degrees 
of freedom, the effect of dispersion at other angles may not be evident. 


EXAMPLE 114.1. Consider the radar interference scenario with a single jammer at an angle 
@ = 30° with a jammer-to-noise ratio JNR = 50 dB. Again, we have an M = 10 element array 
with 4/2 spacing. The center frequency of the array is Fe = 1 GHz, and the bandwidthis B = 10 
MHz for a fractional bandwidth of B = 1 percent. The SINR loss of an optimum beamformer 
is found by substituting the wideband correlation matrix from (11.4.26) into the SINR loss in 
(11.3.18) 


Lsine (bs) = V4 (bs) 'V(bs) (11.431) 


Lh ils +o21 since the 
thermal noise is uncorrelated. Scanning across all angles, we can compute the SINR loss, which 
is shown in Figure 11.21 along with the SINR loss had the signal been narrowband. Notice the 
increased width of the SINR loss notch centered about @ = 30°, which corresponds to a dropoff 
in performance in the vicinity of the jammer with respect to the narrowband case. However, 
at angles farther from the jammer there is no impact on performance; that is, Leinr(Gs) © O 
dB. Next we look at the performance of an optimum beamformer that incorporates time-delay 
steering prior to adaptation. In this case, using rR + ont from (11.4.30) in place of RY in 
the SINR loss equation, we can compute the SINR loss of the optimum beamformer using time- 
delay steering, which is also plotted in Figure 11.21. The notch around the jammer at ¢ = 30° 
has been restored to the narrowband case for angles immediately surrounding ¢ = 30°. At the 


where the wideband interference-plus-noise correlation matrix is RY 


0 

-10 
ao 

ZS -20 
a 
2 

F -30 
oO 

-40 

—50 

0 10 20 30 40 50 
Angle (deg) 


FIGURE 11.21 

SINR loss for wideband jammer with JNR = 50 dB at an angle of 
g = 30°. The carrier frequency is Fy = 1 GHz, and the bandwidth 
is B = 10 MHz. Solid line is the narrowband signal, dashed line is 
the wideband signal, and dash-dot line is the wideband signal with 
time-delay steering. 


angles a little farther away, the performance is still worse than that for the narrowband case but 
still significantly better than that without time-delay steering. 


11.5 ADAPTIVE BEAMFORMING 


So far, we have only considered the optimum beamformer but have not concerned our- 
selves with how such a beamformer would be implemented in practice. Optimality was 
only achieved because we assumed perfect knowledge of the second-order statistics of the 
interference at the array, that is, the interference-plus-noise correlation matrix Rj). In this 
section, we describe the use of adaptive methods that are based on collected data from 
which the correlation matrix is estimated. We look at two types of methods: block adaptive 
and sample-by-sample adaptive. A block adaptive implementation of the optimum beam- 
former uses a “block” of data to estimate the adaptive beamforming weight vector and is 
known as sample matrix inversion (SMI). The SMI adaptive beamformer is examined in 
Section 11.5.1 along with the sidelobe levels and training issues associated with the SMI 
adaptive beamformer. Next we introduce the use of diagonal loading within the context 
of the block adaptive SMI beamformer in Section 11.5.2. In Section 11.5.3, we discuss 
sample-by-sample adaptive methods. These methods, as the block adaptive methods, base 
their estimates of the statistics on the data, but update these statistics with each new sample 
and are extensions of the adaptive filtering techniques from Chapter 10 to array processing. 


11.5.1 Sample Matrix Inversion 


In practice, the correlations are unknown and must be estimated from the data. Thus, we turn 
to the maximum-likelihood (ML) estimate of the correlation matrix given by the average 
of outer products of the array snapshots (Goodman 1963) 


K 

A 1 

Rign =D itn (MNT () (11.5.1) 
k=1 
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where the indices nx define the K samples of xj,,(”) for 1 < n < N that make up the 
training set. Many applications may dictate that the collected snapshots be split into training 
data and data to be processed. The ML estimate of the correlation matrix implies that as 
K > o, then Risn — Rj4n; and it is known as the sample correlation matrix. The 
total number of snapshots K used to compute the sample correlation matrix is referred to 
as the sample support. The larger the sample support, the better the estimate Rian of the 
correlation matrix for stationary data. Proceeding by substituting the sample correlation 
matrix from (11.5.1) into the optimum beamformer weight computation in (11.3.15) results 
in the adaptive beamformer (Reed et al. 1974) 


fH -1 
___ Rilv@,) 
v4 (Ri V (os) 
known as the sample matrix inversion adaptive beamformer.’ As for the optimum beam- 
former, an SMI adaptive beamformer can be implemented with low sidelobe control through 


the use of tapers. Simply substitute a tapered steering vector from (11.3.20) for v(@,) in 
(11.5.2) 


(11.5.2) 


Csmi 


Ribs) 
v7 (Ri Ve(bs) 
Similarly, all the adaptive processing methods that will be discussed in Section 11.6, that 
is, the linearly constrained beamformer, all the partially adaptive beamformers, and the 
sidelobe canceler, can be implemented in a similar fashion by substituting the appropriate 
sample correlation matrix for its theoretical counterpart. 

Of course, we cannot expect to substitute an estimate Rien of the true correlation 
matrix Rj;,, into the adaptive weight equation without experiencing a loss in performance. 
We begin by computing the output SINR of the SMI adaptive beamformer 


Mosel v(b)I? — Mozlet iv(o,)/? 
EllegnXien 2} Cg Ri¢nsmi 
92 dRinVGs)P 
* vA (ob, )Re Rien V (os) 
Comparing this to the SINR obtained with the optimum beamformer from (11.3.12), we 


obtain the loss associated with the SMI adaptive beamformer relative to the optimum beam- 
former 


(11.5.3) 


Ctsmi = 


SINRgni = 


(11.5.4) 


SINRsmi [v7 RAV (Gs)? 
SINRo — [v#'(g,)Rj,,RiyaR;|, V(b, IV" (6) R;, V(b) 


This SMI loss is dependent on the array data used to compute Risa, which implies that 
Lmi, like the data, is a random variable. In fact, it can be shown that Li follows a beta 


distribution given by (Reed et al. 1974) 
K! M-2 K+1-M 
Pp(Lsmi) = (M—D\K+1— mi" Lsmi) ~ (Lsmi) (11.5.6) 
assuming a complex Gaussian distribution for the sensor thermal noise and the interference 
signals. Here M is the number of sensors in the array, and K is the number of snapshots 


Lmi = 


(11.5.5) 


Tan adaptive beamformer that is very similar to the SMI adaptive beamformer is known as the adaptive matched 
filter (AMF) (Robey et al. 1992). The difference between the two is actually in the normalization. The AMF 
requires c! Rune = | rather than ec” v(¢,) = 1 so that the interference-plus-noise has unit power at the 
beamformer output. As a result, it is straightforward to choose a detection threshold for the output of an AMF 
beamformer. For this reason, this method is discussed primarily within the context of adaptive detection. It is 
straightforward to show that the relation between the AMF and SMI adaptive weights, as they are defined in 
(11.5.2), is Came = [v7 @. RVs) 1 /2¢5mi- 


used to estimate Rj. Taking the expectation of this loss yields 


re (11.5.7) 

K+1 
which can be used to determine the sample support required to limit the losses due to 
correlation matrix estimation to a level considered acceptable. From (11.5.7), we can deduce 
the SMI loss will be approximately —3 dB for K = 2M and approximately —1 dB for 


K=5M. 


EXAMPLE 11.5.1. In this example, we study the SMI adaptive beamformer and the loss associ- 
ated with the number of snapshots used for training. SMI adaptive beamformers are produced 
with sample supports of K = 1.5M,2M, and 5M. Consider a ULA with M = 20 elements with 
an interference source at @; = 20° and a power of 50 dB. The thermal noise has unit variance 
Cee = 1. We can generate the interference-plus-noise signal xj, as 


v_i = exp(-j*pi*[0:M-1]’*sin(phi_i*pi/180))/sqrt (M) ; 
x_ipn=(10* (40/20) ) *v_i* (randn(1,N)+j*randn(1,N))/sqrt(2) + ... 
(randn (1,N)+j*randn(1,N))/sqrt (2); 


The sample correlation matrix is then found from (11.5.2). We compute the SINR at an angle of 
¢ by first computing the SMI adaptive weight vector from (11.5.3), and the SINR from (11.5.4) 
using the actual correlation matrix Rj,,, computed by 


R_ipn = (10*(40/10))*v_i*v_i’ + eye(M); 


and a signal of interest with M of = 1. We repeat this across all angles —90° < ¢ < 90° 
and average over 100 realizations of x;;,. The resulting average SINR for the various sample 
supports is shown in Figure 11.22 along with the SINR for the optimum beamformer computed 
from (11.3.12). Note that for a signal of interest with M of = | and unit-variance noise, the 
SINR of the optimum beamformer is equal to its SINR loss. The jammer null is at 6 = 20°, 
as expected for all the beamformers. However, we notice that the SINR of the SMI adaptive 
beamformers is less than the optimum beamformer SINR by approximately 4, 3, and 1 dB for 
the sample supports of K = 1.5M,2M, and 5M, respectively. These losses are consistent with 
the SMI loss predicted by (11.5.7). 


Sidelobe levels of the SMI adaptive beamformer 


In addition to affecting the SINR of the beamformer output, the use of array snapshots to 
estimate Rj, has implications for the sidelobe levels of the resulting adaptive beamformer. 
The following analysis follows directly from Kelly (1989). Consider a signal received from 
a direction other than the direction of look ¢,. The response to such a signal determines the 
sidelobe level of adaptive beamformer at this angle. For the MVDR optimum beamformer 
from (11.3.15), the sidelobe level (SLL) at an angle ¢,, is given by 


v4 (b)R nV (Gu)I 
v4 (Ri Vb )I? 


where @, is the beamformer steering angle or look direction. Likewise, we can also define 
the SINR of a signal s(n) = o,,v(¢,,) received from an angle @,, in the sidelobes of the 
optimum beamformer steered to ¢, 


SLL, = |Co(¢,)|? = 


(11.5.8) 


sae doe le sm? oalv Riv (GW) 
ee ELC) Xin MP} VAG R GL (G5) 
SINRo ($1. Ou1V7 (bs Rig nV (bu)! (11.5.9) 


[v7 (Ri V(b, IV? RG nV (Gu)! 
= SINRo($,. $,) C08” (V($5). V(by): Ri) 
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FIGURE 11.22 

SINR loss for SMI adaptive beamformer with different numbers of 
training snapshots. Thin solid line has 30 snapshots (K = 1.5M), 
dashed line has 40 snapshots (K = 2M), and dash-dot line has 100 
snapshots (K = 5M). Thick, solid line is SINR loss for the 
optimum beamformer. 


since SINR, (¢,,¢,) = aay (OJ RianV(bu)> which is the maximum output SINR possible 
for a signal at angle @,,, that is, the SINR if the optimum beamformer had been properly 
steered in this direction. The term 


4 (p)Ri,V (bu) 
£08(¥(0s), Wu): Rig) = ie ea 
ORONO RO? rs ay 
_ 14 (b,)V(b,) 
TCHLICH WERLECHICHE 
where #($) = Lavo) (115.11) 


measures the cosine of a generalized angle between vectors v(¢,) and v(@,,) (Cox 1973). 
This last quantity is the cosine of the angle between the whitened vectors V(@,) and V(@,,) 
at the respective angles of ¢, and ¢,,. The matrix Lj,, is simply the Cholesky factor of the 
correlation matrix, thatis, Ri+. = LiL, . The sidelobe level of the optimum beamformer 
from (11.5.8) can also be written in terms of the SINR from (11.5.9) 


__ SINRo($5, by) 
~ SINRo (Gs, $5) 


From (11.5.9), cos’(v(o,), v(¢,); Ri) — SINRo(@s; Pu) (11.5.13) 


SINRo (by; bu) 
which is not the same as the sidelobe level in (11.5.12). However, this cosine term is 
a measure of the attenuation provided by an optimum beamformer steered in the di- 
rection @, aS opposed to the maximum SINR provided by steering to angle ¢,. Thus, 
cos*(v(@,), Vv(o,)3 Ry) can be thought of as the sidelobe level at an angle ¢,, of an opti- 
mum beamformer steered to @, in the absence of interference at ¢,,. As a result, this term 
serves as an upper bound on the sidelobe level. 


SLL, = ICo(¢,)I? (11.5.12) 


Turning our attention to the SMI adaptive beamformer, we begin by computing the 663 
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leni Ps S70) | 


EX\el (bs) Xin (2) 7} 
_ onlv" gs RiaV ud? 
v4 (Ri, RivnR, Vs) 


SINRgmi(¢s, $y) = = 


(11.5.14) 

= SINR, (¢,, oy) 

v7 (@ RAV (bu)! 
“WH, RAL RiaR REV Gs ILV" (GR nV (Pu)! 
= SINR, (¢,, 6) L (bs, by) 
where 

_ SINRsmi(@s; oy) 

L(bs, Oy) = “SINR, (¢,.¢,). 
(11.5.15) 


v7 (GR Vu)? 
[v (OR RinR, |, VO, IV" (6) Ri, nV (Gu) 


This term is bounded by 0 < L(¢,,¢,,) < 1 and can be interpreted as the loss of a signal 
received from the sidelobe angle ¢,, processed with an SMI adaptive beamformer steered to 
¢, relative to the maximum SINR possible for this signal. The term in the denominator of 
(11.5.15) is the SINR of the optimum, not the SMI adaptive beamformer. It is evident that as 
ae ee of array snapshots K — oo, Ruin —> Risn, L(b;,6y) > cos’(v(¢, ), V(b,)3 
1) from (11.5.15). The sidelobe level, however, of the SMI adaptive beamformer is 


lv (Rv bw)? 
Iv (pK LVv(b,)/ 
However, unlike the sidelobe level of the optimum beamformer in (11.5.12) which could 


be related to the SINR of signals in the sidelobes, such a relation does not hold for the SMI 
adaptive beamformer because 


v7 ($)Ri Rien V(bs) 4 V" ORAL Gs) (1.5.17) 


Asymptotically, this relation holds, but for finite sample support it does not. Nonetheless, we 
can draw some conclusions about the anticipated sidelobe levels using L(¢,, @,,). The loss 
in SINR of the sidelobe signal L(@,, @,,) is arandom variable with a probability distribution 
(Boroson 1980) 


Ri, 


SLLemi = |Cani(¢,,)|7 = 


(11.5.16) 


J 
J . 
pL, ®) => (4) cos”(v(~s), Vu) Rig)” 
j=0 
x sin?(v(@,), W(b,); Rign)! pp(L, J +1, M— 1) 
(11.5.18) 
where sin?(v(#,), V($,); Rizn) = 1 — cos?(v($,), V(G,,); Riz): Recall that cos?(v(¢,), 


v(¢,); R71) depends on the true correlation matrix. The term J is given by 
J=K+1-M (11.5.19) 
and pg(x,1, m) is the beta probability distribution given by 
(i+m-—1)! 


fGen (1.5.20) 


pp(x,1,m) = 
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From this probability distribution, we can compute the expected value of the loss of a signal 
received in the sidelobes of the SMI adaptive beamformer 


E\L (bs, Ou} = Kal 


For the case of perfect alignment (¢, = ¢,), equation (11.5.21) measures the loss in SINR 
in the look direction since cos?(-) = 1 and E{L(¢,, 6,)} = Lymi = (K +2 — M)/(K +1) 
from (11.5.21), which is the standard SMI SINR loss. In the opposite extreme, if @,, is the 
angle of a null in the corresponding optimum beamformer, then cos(-) = 0 and 


>—_ [14+ (K + 1 — M) cos -(v(o, ), V(d,,); Rit) (11.5.21) 


1 
E{L($5,by)} = Kal (11.5.22) 


The expected value of this loss can be interpreted as a bound on the sidelobe level provided 
that no interference sources were present at angle ¢,. The implication of this equation 
is a lower bound on the sidelobe level that can be achieved by using an SMI adaptive 
beamformer. Note that all this analysis also applies for tapered SMI adaptive beamformers 
when we substitute vi(¢,) — v(@,) as we did for the weights in (11.5.3). As arule of thumb, 
we can use (11.5.22) to determine the sample support required for the desired sidelobe level. 
For example, if we were to design an adaptive beamformer with —40-dB sidelobe levels, 
we would require on the order of K = 10,000 snapshots. 


EXAMPLE 11.5.2. We want to explore the effect of the number of training samples on the side- 
lobe levels of the SMI adaptive beamformer. To this end, we generate an interference signal at 
o; = 10° with a power of 70 dB and noise with unit variance (02, = 1) fora ULA with M = 40 
elements. The interference-plus-noise signal x; is generated by 


v_i = exp(-j*pi*[0:M-1]’*sin(phi_i*pi/180) )/sqrt (M) ; 
x_ipn=(10* (70/20) ) *v_i* (randn(1,N)+j*randn(1,N))/sqrt(2) + ... 
(randn (1,N)+j*randn(1,N))/sqrt (2); 


The sample correlation matrix is computed using (11.5.1). Then the SMI adaptive beamformer 
weights are computed from (11.5.2) with a look direction of ¢, = 0°. We can compute the 
beampattern of the SMI adaptive beamformer using (11.2.3). The resulting beampatterns av- 
eraged over 100 realizations for SMI adaptive beamformers with sample support of K = 100 
and K = 1000 are shown in Figure 11.23 for —10° < @ < 90° along with the beampattern of 
an optimum beamformer computed using the weight vector in (11.3.15) and a true correlation 
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FIGURE 11.23 

Beampatterns of an SMI adaptive beamformer for (a) K = 100 snapshots and (b) K = 1000 
snapshots. The dashed line is the quiescent response (optimum beamformer), and the solid line 
is the SMI adaptive beamformer. 
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Clearly, the sidelobe levels of the SMI adaptive beamformer are limited by the sample support 
available for training. For the case of K = 100, the sidelobe level is approximately —18 dB, 
whereas for K = 1000, the sidelobe level is approximately —30 dB. 


Training issues 


To implement the SMI adaptive beamformer, we need an estimate of the interference- 
plus-noise correlation matrix, which of course requires that no desired signal s(7) be present. 
The use of Rj+n provided an attractive theoretical basis for the derivation of the optimum 
beamformer and its subsequent adaptive implementation with the SMI technique. Although 
it can be shown that the use of a correlation matrix containing the desired signal produces 
equivalent adaptive weights in the case of perfect steering, this can almost never be accom- 
plished in practice. Usually, we do not have perfect knowledge of the exact array sensor 
locations and responses. Coupled with the fact that often the angle of the desired signal is not 
knownexactly for cases when we are searching for its actual direction, the presence of the de- 
sired signal in the training set results in the cancelation and subsequent loss in performance. 

How do we get a signal-free estimate of the correlation matrix from array data in 
practice? In many applications, such as in certain radar and communications systems, we 
control when the desired signal is present since it is produced by a transmission that we 
initiate. In the case of jamming, common to both these applications, we can choose not to 
transmit for a period of time in order to collect data with which we can estimate Rj+,. This 
type of training is often termed listen-only. For other types of interference that are only 
present at the same time as the desired signal, such as clutter in radar and reverberations in 
active sonar, the training can be accomplished using a technique known as split window. If 
we use a training set consisting of data samples around the sample of interest (before and 
after), we can exclude the sample of interest, and possibly some of its neighboring samples, 
to avoid the inclusion of the desired signal in the training set. This method has significant 
computational implications because it requires a different correlation matrix and therefore 
a separate computation of the adaptive beamforming weights for each sample under con- 
sideration. This problem can be alleviated somewhat by using matrix update methods, as 
discussed in Chapter 10; nonetheless, the increase in cost cannot be considered insignificant. 

Certain methods have been proposed for the purposes of reducing the computations 
associated with estimating the correlation matrix. One such method is to assume the corre- 
lation matrix is Toeplitz for a ULA. Of course, this assumption is valid if the array consists 
of elements with equal responses Hy (F, ¢) as a function of both frequency and angle from 
(11.1.14). However, in practice, this assumption almost never holds. The fact that the spatial 
signals are measured using different sensors, all with different responses, coupled with the 
limits on mechanical precision of the sensor placement in the array inevitably will cause 
these assumptions to be violated. As a result, constraining the correlation matrix to be 
Toeplitz, which is akin to averaging the correlations down the diagonals of the correlation 
matrix, will cause performance degradation that can be significant. These methods are well 
suited for temporal signals that are measured with a common sensor and are sampled at a 
rate that is very accurately controlled via a single analog-to-digital converter. Unfortunately 
with arrays, the spatial sampling process is not nearly as precise, and the use of multiple 
sensors for measurements can produce vastly different signal characteristics. 


11.5.2 Diagonal Loading with the SMI Beamformer 


Clearly, the ability of an SMI adaptive beamformer to achieve a desired sidelobe level relies 
on the availability of sufficient sample support K. However, for many practical applications, 
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owing to either the nonstationarity of the interference or operational considerations, a limited 
number of samples are available to train the SMI adaptive beamformer. How, then, can 
we achieve this desired low sidelobe behavior? First, recall that the beam response of an 
optimum beamformer can be written in terms of its eigenvalues and eigenvectors as in 
(11.3.23). Likewise, for the SMI adaptive beamformer 


A A 


= Am _— Amin A a 
Comi(@) = _ Cae) - 2 a Lin v10n(0| (1.5.23) 


mn m=1 


where re and q, are the eigenvalues and eigenvectors of Risa, respectively, and Cy(@) 
and O m(@) are the beampatterns of the quiescent weight vector and the mth eigenvector, 
known as an eigenbeam, respectively. Therefore, Cymi(@) is simply Cq(¢) minus weighted 
eigenbeams that place nulls in the directions of interferers. The weights on the eigenbeams 
are determined by the ratio Gy _ Dean Vf i The noise eigenvectors are chosen to fill the 
remainder of the interference-plus-noise space that is not occupied by the interference. Ide- 
ally, these eigenvectors should have no effect on the beam response because the eigenvalues 
of the true correlation matrix Ay, = Amin = oF, for m > P. However, this relation does 
not hold for the sample correlation matrix for which the eigenvalues vary about the noise 
power o?, and asymptotically approach this expected value for increasing sample support. 
Therefore, the eigenbeams affect the beam response in a manner determined by their devi- 
ation from the noise floor Oo. Since, as in the case of the sample correlation matrix, these 
eigenvalues are random variables that vary according to the sample support K, the beam 
response suffers from the addition of randomly weighted eigenbeams. The result is a higher 
sidelobe level in the adaptive beampattern. 

A means of reducing the variation of the eigenvalues is to add a weighted identity 
matrix to the sample correlation matrix (Hudson 1981, Carlson 1988) 


R, = Rin toil (1.5.24) 


a technique that is known as diagonal loading. The result of diagonal loading of the cor- 
relation matrix is to add the loading level to all the eigenvalues. This, in turn, produces a 
bias in these eigenvalues in order to reduce their variation. To obtain the diagonally loaded 
SMI adaptive beamformer, simply substitute R, into (11.5.2) 


R'y 
Cismi = a (11.5.25) 


v7 (@,)Ry 'V(os) 
The bias in the eigenvalues produces a slight bias in the adaptive weights that reduces the 
output SINR. However, this reduction is very modest when compared to the substantial 
gains in the quality of the adaptive beampattern. 

Recommended loading levels are «7, < oT < 1002,. The maximum loading level is 
dependent on the application, but the minimum should be at least equal to the noise power 
in order to achieve substantial improvements. The loading causes a reduction in the nulling 
of weak interferers, that is, interferers with powers that are relatively close to the noise 
power. The effect on strong interferers is minimal since their eigenvalues only experience 
a minor increase. One added benefit of diagonal loading is that it provides a robustness to 
signal mismatch, as described in Section 11.4.1. 


EXAMPLE 11.5.3. In this example, we explore the use of diagonal loading of the sample cor- 
relation matrix to control the sidelobe levels of the SMI adaptive beamformer using the same 
set of parameters as in Example 11.5.2. The beampatterns for the SMI adaptive beamformer 
and the diagonally loaded SMI adaptive beamformer are shown in Figure 11.24 along with 
the beampattern for the optimum beamformer for —10° < ¢ < 90°. The diagonal loading 
level was set to 5 dB above the thermal noise power, that is, oe = 10°, and the sample sup- 
port was K = 100. The sidelobe levels of the diagonally loaded SMI adaptive beamformer 
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FIGURE 11.24 

Beampatterns of an SMI adaptive beamformer for K = 100 snapshots 
without diagonal loading (dashed line), and with oF =5 dB 

diagonal loading (solid line). The beampattern of the optimum 
beamformer is also shown with the dash-dot line. 


Power (dB) 


are very close to those of the optimum beamformer that used a known correlation matrix, 
while for the SMI adaptive beamformer the sidelobes are at approximately —18 dB. To gain 
some insight into the higher sidelobe levels of the SMI adaptive beamformer, we compute the 
eigenvalues of the SMI adaptive beamformer without diagonal loading using the MATLAB com- 
mand lambda = eig(Rhat); where Rhat is the sample correlation matrix from (11.5.1). The 
eigenvalues of the sample and true correlation matrix are shown in Figure 11.25. The largest 
eigenvalue, corresponding to the 70-dB jammer, is approximately 70 dB but cannot be observed 
on this plot. We notice that for K = 100 training samples, the noise eigenvalues of Rian are 
significantly different from those of Rj, with larger than a 10-dB difference in some cases. 
As we stated earlier, the effect on the beampattern of the SMI adaptive beamformer is to add 
a random pattern weighted by this difference in eigenvalues. In the case of diagonal loading, 
the eigenvalues have as a lower bound the loading level ae which, in turn, reduces these errors 
that are added to the beampatterns. The cost of the diagonal loading is to limit our ability to 
cancel weak interference with power less than the loading level. However, in the case of strong 
interference, almost no loss in terms of interference cancelation is experienced by introducing 
diagonal loading. 


FIGURE 11.25 

Noise eigenvalues of the SMI adaptive 
beamformer without diagonal loading oe =1 
(dashed line) and the optimum beamformer 
(solid line). 
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11.5.3 Implementation of the SMI Beamformer 


Although the SMI adaptive beamformer is formulated in terms of an estimated correlation 
matrix, the actual implementation, as with the least-squares methods discussed in Chapter 
8, is usually in terms of the data samples directly. In other words, the actual estimate of 
the correlation matrix is never formed explicitly. Methods that are implemented on the data 
directly are commonly referred to as amplitude domain techniques, whereas if the sample 
correlation matrix had been formed, the implementation would be said to be performed in the 
power domain. The explicit computation of the sample correlation matrix is undesirable, first 
and foremost because the squaring of the data requires a large increase in the dynamic range 
of any processor. Numerical errors in the data are squared as well, and for a large number of 
training samples this computation may be prohibitively expensive. In this section, we give 
a brief discussion of the implementation considerations for the SMI adaptive beamformer, 
where the implementation is strictly in the amplitude domain. The incorporation of diagonal 
loading in this setting is also discussed since its formulation was given in the power domain. 
The SMI beamformer is based on the estimated correlation matrix from (11.5.1). This 
sample correlation matrix may be written equivalently as 
: en 4 eer 
Riin = < So x(ng)x (nx) = ran xX (11.5.26) 
k=1 
where X is the data matrix formed with the array snapshots that make up the training set 
for the SMI adaptive weights, presumably containing only interference and noise, that is, 
no desired signals. This data matrix is 


Xt= [x(n1) x(n2) --- x(nx)] (11.5.27) 
xi(m}) Xi (m2)— +++ x1 (MK) 
X2(m}) -X2(M2)_— +++ XQ(NK) 
=]. : . : (11.5.28) 
XmM(m1) XuM(n2) +++ Xu(K) 
where nx, fork = 1,2,..., K, are the array snapshot indices of the training set. As was 


shown in Chapter 8, we can perform a QR decomposition on the data matrix to obtain the 
upper triangular factor 


X =QR, (11.5.29) 


where Q is a K x M orthonormal matrix and R, is the M x M upper triangular factor. If 
we define the lower triangular factor as 


RU (11.5.30) 


the sample correlation matrix can then be written as 


aK 1 1 
Rin = —X?X = —RIR, = LLY (11.5.31) 
K K 
since Q”Q = I. The SMI adaptive weights from (11.5.2) are then found to be 
Rives) LoL! 
Csmi ia APs ape OD (1.5.32) 


~ VAG )RLvb) Lx VP 


The implementation of diagonal loading with the SMI adaptive beamformer is also 
possible in the amplitude domain. Recall that the diagonally loaded correlation matrix from 
(11.5.24) is given by 


ess sh 1 1 
Ri = Rin + 071 = qx x +o714 yx x (11.5.33) 


where X; is the “diagonally loaded" data matrix. Of course, data matrix X is not a square 
matrix, and thus it is not actually diagonally loaded. Instead, we append the data matrix 
with the square root of the loading matrix as 


x? —[x" /Kol] (11.5.34) 


The resulting diagonally loaded SMI adaptive weights are found by substituting X; for X in 
the amplitude-domain implementation of the SMI adaptive beamformer given above. The 
practical implementation of the SMI adaptive beamformer is performed in the following 
steps: 


. Compute the QR factorization of data matrix X = QR. 

. Find the Cholesky factor by normalizing the upper triagular factor Ly = (1//K yRU . 
. Solve for z; from Lyz; = v(¢,). 

. Solve for z2 from L? 2 =7Z\. 

. The SMI adaptive weight vector is given by ¢smi = Z2/||z1||7. 


nABWN eK 


11.5.4 Sample-by-Sample Adaptive Methods 


The SMI adaptive beamformer is a least-squares (LS) block adaptive technique similar 
to the LS methods discussed in Chapter 8. However, the optimum beamformer can also 
be implemented by using methods that compute the beamforming weights on a sample- 
by-sample basis; that is, the weights are updated for each new sample. Such methods 
are referred to as sample-by-sample adaptive and are simply extensions of the adaptive 
filtering methods from Chapter 10. The manner in which sample adaptive beamformers 
differ from adaptive filters is that rather than solve an unconstrained LS problem, adaptive 
beamformers solve a constrained LS problem. The implication of this constraint is that 
rather than have an estimated cross-correlation in the normal equations R(n)e = d(n), 
we have the deterministic steering vector v(¢,). Unlike the cross-correlation vector, the 
steering vector is known a priori and is not estimated from the data. We briefly discuss both 
techniques based on recursive least-squares (RLS) and steepest-descent methods. Since the 
derivation of the methods follows that for the adaptive filters in Chapter 10 quite closely, 
we only give a brief sketch of the algorithms along with some discussion. 

An important consideration for sample adaptive methods is whether or not these tech- 
niques are appropriate for array processing applications. The problem with these methods 
is the amount of time required for the adaptive weights to converge. In many applications, 
the delay associated with the convergence of the adaptive beamformer is not acceptable. 
For example, a radar system might be attempting to find targets at close ranges. Range cor- 
responds to the time delay associated with the propagation of the radar signal. Therefore, 
close ranges are the first samples received by the array during a collection period. A sample 
adaptive method that uses the close ranges to train (converge) could not find the targets at 
these ranges (samples). In fact, the time needed for convergence may not be insignificant, 
thus creating a large blind interval that is often unacceptable. However, the sample-by- 
sample adaptive techniques are appropriate for array processing applications in which the 
operating environment is nonstationary. Since the sample-by-sample adaptive beamformer 
alters its weights with each new sample, it can dynamically update its response for such a 
changing scenario. 

Another important distinction between sample and block adaptive methods is the in- 
clusion of the signal of interest in each sample and thus in the correlation matrix. Therefore, 
for sample adaptive methods, we cannot use a signal-free version of the correlation matrix, 
that is, the interference-plus-noise correlation matrix Rj,,, but rather must use the whole 
correlation matrix R,. The inclusion of the signal in the correlation matrix has profound 
effects on the robustness of the adaptive beamformer in the case of signal mismatch. This 
effect was discussed in the context of the optimum beamformer in Section 11.4.1. 
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Recursive least-squares methods 


We will not spend a lot of time discussing recursive least-squares (RLS) methods 
for adaptive beamforming since this topic is treated in Chapter 10. For further details on 
RLS methods used in array processing, the interested reader is referred to Schreiber (1986), 
McWhirter and Shepherd (1989), Yang and BOhme (1992). An important difference between 
the methods discussed here and those in Section 10.6 is in the normal equations that solve 
a constrained rather than an unconstrained optimization. The output signal of the adaptive 
beamformer is 


y(n) =e" x(n) (11.5.35) 


However, y(7) is not the desired response. In Section 10.6, we developed techniques based 
on the normal equations Re = d, where d is the estimated cross-correlation. However, 
for the adaptive beamformer we use the steering vector v(¢,), which is deterministic, in 
place of d. Algorithms based on RLS methods can be implemented such that the output 
y(n) is computed directly (direct output extraction) or the adaptive beamformer weights are 
computed and then applied to determine the output (see Section 10.6). The simplifications 
for the beamformer case are discussed in Yang and BOhme (1992) and Haykin (1996). 

The RLS methods are based on the update equation of the estimate of the correlation 
matrix 


R,(n +1) = AR, (n) + x(n + Dx? (n +1) (11.5.36) 


where 0 < A < 1 is a scalar sometimes referred to as the forgetting factor. From the 
updated sample correlation matrix, an update for its inverse can be found by using the matrix 
inversion lemma from Appendix A. The adaptive beamformer weight vector is then found 
by modifying the solution to the MVDR adaptive weights with the updated inverse sample 
correlation matrix. In practice, these updatings are implemented by slightly modifying any 
of the algorithms described in Section 10.6. 


Steepest-descent methods 


The LMS algorithm from Section 10.4 is based on the method of steepest descent. 
However, the desired response used to form the LMS adaptive weights is not clear for the 
adaptive beamforming application. Instead, there is the steering vector v(@,) that specifies 
the direction to which the adaptive beamformer is steered, namely, the angle ¢,. The re- 
sulting constrained optimization produced the optimum MVDR beamformer from Section 
11.3. The sample adaptive implementation of this constrained optimization problem based 
on steepest descent was first proposed by Frost. The resulting algorithm uses a projection 
operation to separate the constrained optimization into a data-independent component and 
an adaptive portion that performs an unconstrained optimization (Frost 1972). The original 
algorithm was formulated using multiple linear constraints, as will be discussed in Section 
11.6.1. However, in this section we focus on its implementation with the single unity-gain 
look-direction constraint for the MVDR beamformer. Note that the separation of the con- 
strained and unconstrained components proposed by Frost provided the motivation for the 
generalized sidelobe canceler (GSC) structure (Griffiths and Jim 1982) discussed in Sec- 
tion 11.3.5. Below, we simply give the procedure for implementing the Frost algorithm. 
The interested reader is referred to Frost (1972) for further details. 

Formally, the MVDR adaptive beamformer is attempting to solve the following con- 
strained optimization 


mine’R,e — subjectto e”%v(¢,) = 1 (11.5.37) 


where the entire correlation matrix R, including the desired signal is used in place of the 
interference-plus-noise correlation matrix Rj;, since we assume the signal of interest is 
always present. The correlation matrix is unknown and must be estimated from the data. 
To start the algorithm, we can form an M x M projection matrix P that projects onto a 


subspace orthogonal to the data-independent steering vector v(@,). This projection matrix 
is given by (see Chapter 8) 


P=I1-vi(¢,)v" (@,) (11.5.38) 
We can then define the nonadaptive beamformer weight vector as 
Cna = V(¢,) (11.5.39) 


which is simply the spatial matched filter from Section 11.2. The update equation for the 
sample adaptive beamformer based on Frost’s steepest-descent (sd) algorithm is then written 
as 


Csa(n + 1) = Cha + Plesa(n) — wy" (n)x(n)] (11.5.40) 
where ju is the step-size parameter and 
y(n) = cix(n) (11.5.41) 


is the output of the steepest-descent sample adaptive beamformer. 

Since the projection matrix P maintains orthogonality between the c,, and the adapted 
portion of (11.5.40), the nonadaptive beamformer weights from (11.5.39) maintain the unity- 
gain constraint from (11.5.37). In fact, since the adaptation is performed on the component 
orthogonal to Cpa in an unconstrained manner, the Frost algorithm is essentially using 
an LMS adaptive filter in the GSC architecture from Section 11.3.5. The convergence of 
the Frost adaptive beamformer, as for the LMS adaptive filter, is controlled by the step- 
size parameter j. In order for the adaptive beamformer weights to converge, the step-size 
parameter must be chosen to be 


1 
0<p<-x (11.5.42) 


Amax 
where vas is the maximum eigenvalue of the matrix 
R = PR,P (11.5.43) 


More details about the algorithm can be found in Frost (1972). 

The sample adaptive beamformer based on the Frost algorithm maintains a look direc- 
tion of @, through the constraint clv(¢,) = |. This constraint is easily seen by interpreting 
the adaptive weight update equation in (11.5.40) as the steering vector Ch, = v(@,) updated 
by acomponent orthogonal to v(@, ). In the case of a signal received from a direction @,, the 
adaptive beamformer will immediately track this signal, since it is constrained to observe 
signals at @, and is not part of the adaptation. The convergence of this sample adaptive 
beamformer in terms of interference rejection is very similar to the LMS algorithm. See 
Chapter 10 for details on the LMS algorithm. 


11.6 OTHER ADAPTIVE ARRAY PROCESSING METHODS 


In this section, we consider various other adaptive array processing methods. First, we look 
at the use of multiple constraints in an adaptive array beyond the single constraint of dis- 
tortionless response for the MVDR optimum beamformer. Second, we consider partially 
adaptive arrays that are methods that perform deterministic preprocessing prior to adap- 
tation, in order to reduce the adaptive degrees of freedom. These methods are commonly 
used in practice for both computational reasons as well as limited sample support. Third, we 
describe the sidelobe canceler that was the first proposed adaptive array processing method. 
In addition to its historical significance, the sidelobe canceler is still a viable technique for 
certain array processing applications. Throughout this section, we use the word adaptive to 
indicate that the various methods are based on training data. However, the derivations are all 
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in terms of known statistics. Although none of these methods can really be called optimum, 
each one satisfies an optimization criterion in the case of known statistics. The implementa- 
tion of the methods using actual data samples in place of assuming known statistics follows 
directly from the techniques described in Section 11.5. 


11.6.1 Linearly Constrained Minimum-Variance Beamformers 


In Section 11.3, we discussed the optimum beamformer that maximizes the signal-to- 
interference-plus-noise ratio (SINR). This optimum beamformer was also formulated as 
the solution to a constrained optimization problem, namely, 


mine’Rijnje ~~ subjectto ~—e” v(o,) = 1 (11.6.1) 


where v(@,) is the array response vector for a signal arriving from an angle ¢,. Due to this 
alternate formulation, the optimum beamformer is commonly referred to as the minimum- 
variance distortionless response (MVDR) beamformer. 

However, some applications may require additional conditions on the beamformer. 
As with the optimum beamformer, we want to minimize the output power c7 Rj,,¢, but 
with additional constraints on the response of the beamformer. The imposition of further 
constraints on the minimum-variance beamformer results in suboptimum performance in 
terms of SINR. However, if designed properly, the constraints should have little effect on 
SINR while yielding some desirable attributes. One common use of constraints is for the 
case when the angle of an interference source ¢; is known a priori. In this case, we want to 
reject all energy received from this angle, that is, 


c“v(¢;) =0 (11.6.2) 


The result of the null constraint is an adaptive beamformer that rejects all energy from the 
angle @;. Another type of constraint is to require the beamformer to pass signals not only 
from the angle ¢,, but also from another angle ¢,. As for the MVDR beamformer, this 
constraint is formulated as 


ce“ v(¢,) =1 (11.6.3) 


In this manner, multiple angles can be specified to pass signals of interest with unity gain. 
Such amplitude constraints can also be used to preserve the response of the beamformer in an 
angular region about ¢, (Steele 1983, Takao et al. 1976). These additional constraints help 
to make the resulting adaptive beamformer more robust to signal mismatches, as discussed 
in Section 11.4.1, that result from the actual angle of the desired signal ¢o slightly differing 
from its presumed angle ¢,. Therefore, if we choose a pair of angles slightly offset from ¢, 


b1=¢s-AP bn = bs +t AG (11.6.4) 


the response of the beamformer steered to ¢, broadens. The effect in terms of mainlobe 
width is similar to tapering the MVDR beamformer when the angle offset A@ is small. An 
alternative approach to robust adaptive beamforming is the use of derivative constraints. 
See Applebaum and Chapman (1976), Er and Cantoni (1983), and Steele (1983) for details. 

Once we have determined a set of constraints, for example, the desired responses at a set 
of angles, we can solve for the constrained adaptive beamformer. The result is known as the 
linearly constrained minimum-variance (LCMV) beamformer (Applebaum and Chapman 
1976; Buckley 1987). As we stated earlier, we want to minimize the output energy of the 
beamformer subject to a set of constraints. This problem is formulated as 


mine’Rj,,¢ —subjectto C%ce=6 (11.6.5) 


where C is known as the constraint matrix and 6 is the constraint response vector. For 
example, if we want to pass signals from an angle @, as well as preserve its response with 


a pair of amplitude constraints at the angles ¢, + Ad, the constraint matrix and constraint 
response vectors are given by 


C = [v(g,) vo, — AG) vo, + AG)] 8 =[1 111" (11.6.6) 
As for the MVDR beamformer, the solution for the LCMV beamformer is found by using 
Lagrange multipliers (see Appendix B). The LCMV beamformer weight vector is given by 


Clomy = RZ |,C(C7R1,C) 18 (11.6.7) 


As for the MVDR beamformer, the LCMV beamformer can also be formulated using a 
generalized sidelobe canceler architecture (Griffiths and Jim 1982), discussed in Section 
11.3.5. In fact, the MVDR beamformer is simply a special case of the LCMV beamformer 
with C = v(¢,) and 6 = 1. 

In this section, we have described the use of linear constraints in a minimum-variance 
beamformer. However, the use of quadratic constraints within the context of a minimum- 
variance beamformer is also possible. The primary motivation for using these quadratic 
constraints is for robustness purposes against signal mismatch, as discussed in Section 
11.4.1. One such quadratic constraint adds a constraint on the norm of the weight vector 
of the adaptive beamformer in addition to the MVDR constraint (Cox et al. 1987; Maksym 
1979) 


mine” R,,,e — subjectto =e” v(d,) = 1 = and _—_ le ||? < k? (11.6.8) 
whose solution is given by 
c= (Rijn + 7D) 'V(o5) (11.6.9) 


where 7 is a constant and ae is a scaling term on the identity matrix. Thus, mimimizing the 
norm of the adaptive beamforming weight vector is equivalent to adding a weighted iden- 
tity matrix to the interference-plus-noise correlation matrix. The solution to this quadratic 
constraint bears a striking resemblence to diagonal loading as discussed in the context of 
the SMI adaptive beamformer, in Section 11.5.2. In fact, the use of some level of diagonal 
loading is generally a recommended practice for implementing an adaptive beamformer to 
reduce its sensitivity to mismatch, and for the purposes of low sidelobe levels. 


11.6.2 Partially Adaptive Arrays 


The optimum beamformer maximizes output SINR by placing a null in the direction of any 
interference sources while maintaining gain in the direction of interest. Recall the optimum 
beamforming weights from (11.3.13) 


-1 
€) = aR; v(d,) (11.6.10) 


where we choose the MVDR normalization a = [v” (¢)R,,VG)I I. The correlation 
matrix Rj,, is an M x M matrix where M is the number of elements in the ULA. The 
optimum weights adapt to the statistics of the data in an M-dimensional space where M is 
referred to as the adaptive degrees of freedom. However, in many applications, the number 
of elements in the array exceeds the adaptive degrees of freedom that can be practically 
implemented. The implementation of such a beamformer requires the estimation of the 
correlation matrix from collected data. As shown in Section 11.5, the estimation of Ri+n 
requires a certain number of data samples to maintain a desired level of performance. Many 
times, the number of data samples is limited, due to either finite regions over which the data 
are stationary or restrictions on the length of the collection interval. Likewise, the number 
of adaptive degrees of freedom that can be implemented may be limited for computational 
reasons. These restrictions motivate the use of methods that reduce the degrees of freedom 
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prior to adaptation. An array implemented using a reduced number of degrees of freedom 
is referred to as a partially adaptive array. 

Consider an array signal vector x(n) consisting of a desired signal, interference, and 
noise components 


P 

x(n) = s(n) + i(n) + wn) = VM v(¢,)s(n) + VM) v(op)ip(n) + wn) (1.6.11) 

p= 

where @,, and 7, (n) are the angle and signal, respectively, of the pth interferer with a total 
of P interferers. Usually, the number of interferers is limited; yet the number of elements 
in the array M may be quite large, that is, M >> P. In general, one adaptive degree 
of freedom is required for each interferer.’ Therefore, we only require some number of 
adaptive degrees of freedom Q > P, not the full dimensionality provided by the number 
of elements M. We want to use a large number of elements in order to have an aperture 
that achieves the desired angular resolution. Therefore, we do not want to limit the number 
of elements in order to reduce the degrees of freedom; rather, we want to project the array 
data into a lower-dimensional subspace in which we can perform our optimization (Morgan 
1978). The projection is accomplished using a nonadaptive preprocessor and is modeled as 
arank-reducing transformation matrix T with dimensions M x Q applied to the array signal 


X(n) = T? x(n) (11.6.12) 


where X(7) is a signal vector of dimension Q. Likewise, the interference-plus-noise signal 
in the lower-dimensional space is 


Kia n(n) = T? x4) (n) (11.6.13) 
and has a correlation matrix 
Rign = EfKign (Xf, (2)} = TY Rip nT (11.6.14) 
The partially adaptive beamforming weights are then given by 
é = aR;|,4(4,) (11.6.15) 
where ¥(¢,) = T’ v(¢,) (11.6.16) 


is the projection of the M-dimensional steering vector v(¢,) onto the same Q-dimensional 
subspace. The output of the partially adaptive beamformer is then obtained by applying the 
beamforming weights in (11.6.15) to the reduced-dimension array signal from (11.6.12) 


y(n) =e" X(n) (11.6.17) 


The resulting partially adaptive beamformer, shown in Figure 11.26, is no longer opti- 
mal in the sense of the full M-dimensional beamformer, but is optimal given the nonadaptive 
preprocessing transformation onto the Q-dimensional subspace. Thus, the overall perfor- 
mance of the partially adaptive beamformer is governed by how much information was 
preserved by the nonadaptive preprocessor T. The performance of the partially adaptive 
beamformer can be assessed relative to the full-dimensional processor by reconstructing 
the effective M x | beamforming weight vector with the transformation matrix and the 
partially adaptive (pa) weights 

Cpa = Te (11.6.18) 
In addition, we must consider the effect of this preprocessing transformation on the noise 
correlation matrix. For the array signal, we have assumed that the noise has a power of om 


and is uncorrelated, that is, Ry = ork. Therefore, the noise following the application of 
the preprocessing transformation has a correlation matrix given by 


R, = T?RyT = 02,T"T (11.6.19) 


The assumption is that the interferers are narrowband and are well separated in angle. 


675 


Partially SECTION 11.6 
adaptive y(n) Other Adaptive Array 
beamformer Processing Methods 


FIGURE 11.26 
Partially adaptive array using data transformation. 


In the case of an SMI adaptive beamformer, this different structure of the noise correlation 
matrix has implications for diagonal loading. The diagonal loading of the sample correla- 
tion matrix of the full array was performed by adding a weighted diagonal matrix to the 
sample correlation matrix in (11.5.24). For a partially adaptive array that already has had 
a preprocessing transform performed, the diagonal loading of a sample correlation matrix 
becomes 


R,) = Kipp +077"? T (11.6.20) 
where or is the loading level. Since the thermal noise is not necessarily uncorrelated after 
the preprocessing transformation, diagonal loading must account for the transformed noise 
correlation. Otherwise, performance degradation can occur. 

So far, we have only stated that the adaptation for a partially adaptive array must take 
place in a lower-dimensional space using a nonadaptive preprocessor, but we have not 
given any explicit means of performing this task. Below we discuss two commonly used 
preprocessing methods used for partially adaptive arrays. 


Subarray partially adaptive arrays 


Many times, the number of elements in an array can be very large. Thus, one means 
of reducing the adaptive degrees of freedom is to split the array into a number of smaller 
arrays, process the smaller arrays in a nonadaptive manner, and perform adaptation on the 
outputs of these smaller arrays. Let us consider the case in which we are looking for signals 
from a direction ¢,, and the full-dimensional steering vector is vjy(¢,), where we use the 
subscript M to denote the length of the steering vector. The full array may be divided into 
Q equal-sized intervals’ of nonoverlapping subarrays of length 


ee (11.6.21) 
=F 6. 


where we have assumed that M is an integer multiple of Q. The rank-reducing transfor- 
mation for the subarrays then can be written as a sparsely populated matrix made up of 
length-M steering vectors v7 (@;) 


Vig(gs) 0 “+ 0 
0 Vig(os) --: 0 

Pel a (11.6.22) 
0 0 “++ Wi (Ps) 


Each subarray consists of an M-dimensional conventional beamformer steered to o, and 
can be viewed as a highly directional element as opposed to the omnidirectional elements 
assumed for the individual sensors of the array. 


Beamspace partially adaptive arrays 


Another approach to constructing a partially adaptive beamformer is to produce a set 
of beams using the full array. The ensuing adaptation is performed in a reduced-dimension 


é Subarrays need not necessarily have equal length or be nonoverlapping. This restriction is placed on the 
formulation only to simplify the discussion. 
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beamspace, that is, a space spanned by the nonadaptive beams. If we use B beams, the 
rank-reducing transformation matrix is 


T = [v(o1) V2) --- VB) (1.6.23) 


where $;,@>,...,@, are the angles of these beams. These beamformers are typically 
steered in directions around the angle of interest, ¢,. For example, if the angle of interest is 
ds; = 0°, beams might be steered to angles ¢ = —5°, —4°,...,0°,..., 4°, 5°. The spacing 
of the beams depends on the full aperture of the array and the angular extent of interest. 
One can also steer beams in other directions away from the angle of interest, which may 
contain interference sources that we will want to cancel in the partially adaptive processor. 

We have modeled the rank-reducing transformation as a matrix. Usually, the rank of the 
reduced-dimension space is dictated by the number of digital channels that can be formed 
due to hardware limitations. Therefore, the rank reduction process is performed prior to 
sampling using analog beamformers, either across a reduced or full array aperture for the 
subarray or beamspace partially adaptive array processors, respectively. 


11.6.3 Sidelobe Cancelers 


The sidelobe canceler is actually one of the first implementations of an adaptive array 
(Howells 1959), and it was originally proposed by Howells and Applebaum. The method 
uses a main channel along with a single auxiliary, or an array of auxiliary channels, as shown 
in Figure 11.27. The main channel generally has a high gain in the direction of the desired 
signal and is produced by either a highly directional sensor, for example, a parabolic dish, 
or the output of a nonadaptive beamformer, such as a spatial matched filter. The auxiliary 
channels, however, are low-gain elements often with omnidirectional responses that are 
used to augment the main channel. The auxiliary channels can be in a ULA configuration. 
The idea behind the sidelobe canceler is that interference is assumed to be present in both 
main and auxiliary channels, but the desired signal, though present in the main channel 
due to its high gain in the direction of the signal, is below the sensor thermal noise in the 
auxiliary channels. The auxiliary channels are used to form an estimate of the main channel 
interference that can be used for cancelation purposes. The philosophy behind the sidelobe 
canceler is shown in Figure 11.28, using representative beampatterns weighted by their 
directional gains. 
Consider a main channel (mc) signal 


Xme(N) = gs8(n) + ime(n) + wWme(n) (11.6.24) 


consisting of the desired signal s(n) with a gain of gs, an interference signal im¢(n) that 
may be due to several interferers arriving from various angles, and noise Wmec(7) that is 


Xme(M) = S9(1) + ine(M) + Wyye(M) + 


y(n) 


FIGURE 11.27 
Sidelobe canceler. 


Main channel response FIGURE 11.28 

Illustration of the sidelobe canceler 
channel and auxiliary channel 
beampatterns. 


bo $; 


Auxiliary channel net response 


$0 9; 


Sidelobe canceler output response 


$0 9; 


temporally uncorrelated. All three of these signals are assumed to be mutually uncorrelated. 
The interference in this main channel is often so strong that it dominates the desired signal 
even though it has a large gain in the direction of this desired signal. However, the auxiliary 
channel signals may be written as a signal vector 


P 
Xa(n) = s(n)v(ps) + ~ ip(n)V(op) + Wn) (1.6.25) 


p=l 


where v(@) is the array response vector at an angle @ that was given by (11.1.19) for the 
case of a ULA. The desired signal impinges on the auxiliary array from the angle @¢,, 
and the sensor thermal noise w(n) is temporally and spatially uncorrelated. Recall that 
s(n) is usually considered weak enough that it is well below the sensor noise power. The 
interference can be made up of several sources. Here we have chosen a model consisting 
of P interferers with signals i,(n) and angles of arrival of ¢,,. Note that the main channel 
interference im¢(”) is made up of contributions from the same P interferers weighted by 
the spatial response of the main channel in the directions of the interference sources. These 
angles of arrival of the interferers, as well as the exact response of the main channel in these 
directions, are generally unknown and lead to an adaptive solution for the auxiliary channel 
weight vector. 
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The sidelobe canceler estimates the interference in the main channel by using the 
auxiliary channels. As illustrated in Figure 11.27, the auxiliary channels are combined by 
using a set of adaptive weights to form an estimate of the interference in the main channel. 


ime(n) = ec! x,(n) (11.6.26) 


where the adaptive weight vector c, is chosen so as to minimize the output power. Of course, 
the implicit assumption has been made that the signal of interest is below the thermal noise 
level in x, (7). Otherwise, if s() is strong enough in the auxiliary channels, then the sidelobe 
canceler will cancel this signal of interest in addition to the interference. The output signal 
is then obtained by subtracting the estimate of the interference from the main channel 


y(n) = Xmc(n) — ime(n) (11.6.27) 

The output power is given by 
Pout = 07, — E{|e#x,(n)|7} = 07, — c# Raca (11.6.28) 
where R, = E{x,(n)x? (n)} (11.6.29) 


is the auxiliary array correlation matrix. The solution for these weights is simply the linear 
MMSE estimator from Section 6.2, given by 


ca = Ry! tma (11.6.30) 
where Ima = E{Xq(n)x*,(n)} (11.6.31) 


is the cross-correlation vector between the auxiliary array and the main channel. The output 
signal of the sidelobe canceler is 


y(n) = Xme(n) — ey Xan) (1.6.32) 
Hence, the minimum output power is obtained by substituting (11.6.30) into (11.6.28) 


pomin) _ o2, a r RE! Yma (11.6.33) 


out 


Of course, all this analysis has considered the case in which the signal of interest is below 
the thermal noise level in the auxiliary array. Larger signal amplitudes will result in the 
cancelation of the signal of interest using a sidelobe canceler structure. This topic is treated 
in Problem 11.15. 


11.7 ANGLE ESTIMATION 


In this section, we consider the topic of angle estimation, that is, givena spatially propagating 
signal s(n), the determination of its angle of arrival at the array. In the formulation of 
the beamformers in Sections 11.2 through 11.6, the assumption was always made that 
the beamformer was steered to the angle of the desired signal. However, in practice, the 
actual angle from which the signal arrives is not precisely known. Instead, an amount of 
uncertainty exists with respect to the exact angle, even when the signal of interest is within 
the beam. The beamformer is steered to angle @9 while the actual signal arrives from @,. 
The purpose of an angle estimation algorithm is to attempt to determine this angle ¢,. We 
begin with a discussion of the maximum-likelihood (ML) angle estimator. Next we give a 
brief sketch of the Cramér-Rao lower bound on angle accuracy, which provides a measure 
against which the performance of any algorithm can be compared. Then we consider angle 
estimation algorithms, commonly referred to as beamsplitting. In the case of a ULA, a 
spatially propagating signal is equivalent to a complex exponential in the temporal domain. 
Hence, we briefly discuss the use of the frequency estimation techniques from Section 9.6 
that were based on the model of a complex exponential contained in noise. 


11.7.1 Maximum-Likelihood Angle Estimation 


In this section, we give a brief discussion of the maximum-likelihood estimator of the angle 
of a signal arriving at a ULA. Consider a spatially propagating signal of interest 


s = /Mo,v(¢,) (11.7.1) 


where M is the number of sensors in the ULA, o's is the complex amplitude of the signal, and 
@, is the angle of the signal. The complex signal amplitude has a deterministic magnitude 
and uniformly distributed random phase. The signal is received by the ULA along with 
interference i and spatially uncorrelated thermal noise w, that is, 


x = J/Mosv(¢,) i+ w= VMosv(o,) + Xinn (11.7.2) 


We have dropped the discrete-time index n since we are assuming the signal is present’ and 
we are interested in a single snapshot only. The interference-plus-noise correlation matrix 
of the snapshot x is given by 


H 2 
Rin = E{XipnXji yt = Rit oz! (11.7.3) 
Furthermore, we assume that the interference-plus-noise signal xj, has acomplex Gaussian 


density function with zero mean. Thus, the probability density function of the snapshot x 
is a complex Gaussian function with a mean determined by the signal of interest 


PK: Os, bs) = exp {—[x — VMosv(o,)]" Ri. [x — VMosv(¢,)]} 


(11.7.4) 


The peak in this probability density function corresponds to the mean given by the signal of 
interest V Mo;v(¢;), which is the “most likely” event. The ML angle estimate is the angle 
@, for which this probability density function of the snapshot takes on its maximum value, 
that is, 


1 
m™ det(Risn) 


bs = argmax p(X 5, bs) (11.7.5) 
The resulting ML estimator of @, is then given by (Kay 1993) 
H —1y)2 
e v’ (@)R. x 
o, = argmax WOR aX! a | (11.7.6) 
@ vi (g)R, vo) 
Interestingly, this ML estimate can be interpreted as 
b. = argmax = [egiye(G)X1” (11.7.7) 


where Cam (#) is the optimum beamformer given by (11.3.13) with adaptive matched filter 
(AMP) normalization from Table 11.1 


Riv) 


Vv? RV) 


as shown in Robey et al. (1992). This normalization is in contrast to MVDR normalized 
optimum beamformer in (11.3.15) that we have considered for the remainder of this chap- 
ter. Therefore, the ML angle estimator is the angle to which an AMF normalized optimum 
beamformer is steered that maximizes the output power for a given snapshot x. In terms 
of the angle accuracy that might be achieved, the ML estimator can be approximated by 
forming a dense grid of optimum beamformers in angle with angular spacing at the desired 


Camt ($) = (11.7.8) 


‘In many applications, this assumption may be based on an up-front processing stage that determines the presence 
of the signal, known as detection. 
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minimum acceptable accuracy (Baranoski and Ward 1997). In many applications, we might 
want to achieve a much finer resolution than the beamwidth of the ULA, say, one-tenth 
of a beamwidth accuracy, known as 10:1 beamsplitting. Thus, this level of angle accuracy 
would require the computation of roughly 10! AMF optimum beamformers, where M is 
the number of sensors in the ULA. Generally, this requirement is computationally exces- 
sive, and we desire an alternative angle estimation algorithm that can achieve performance 
comparable to the ML estimator. This topic is addressed in Section 11.7.3. However, let us 
first consider the performance of the ML angle estimator that can be used as a bound for 
other angle estimation algorithms, which is the topic of the next section. 


11.7.2 Cramér-Rao Lower Bound on Angle Accuracy 


The Cramér-Rao bound (CRB) places a lower bound on the performance of an unbiased 
estimator (Kay 1993). We provide a sketch of the derivation of the CRB for angle accuracy 
(Ward 1996). This derivation is a simplification of the derivation by Ward (1996) that was 
done for two-dimensional angle and frequency estimation. Note that the CRB provides the 
minimum variance of an unbiased estimator. If an estimator can achieve the CRB, then it is 
the maximum-likelihood estimator. The CRB is found by solving for the diagonal elements 
of the inverse of the Fisher information matrix. For more details see Kay (1993) and Ward 
(1996). 

Let us start by redefining the beamformer for a ULA from the spatial matched filter in 
(11.1.19) that has its phase center moved from the first element to the center of the array 


veg) =e Pia Sn by(g) 


i (11.7.9) 


: 2 . : Bas . = 3 wf 
= [e jantE sing 5 jan 4 sing 2 fin ty tind | 


/M 


which we will refer to as the sum beamformer.’ This choice of a phase center provides the 
tightest bound on accuracy (Rife and Boorstyn 1974). We can define a second beamformer 
based on the derivative of vy (@) given by 


va(d) = j6 Ovs(¢) (11.7.10) 

3=[-4o Mos, MNT’ 
2 2 2 

which can be thought of as a difference taper. The steering vector v, (#), however, provides 

a difference pattern beamformer steered to the angle ¢, as is commonly used in monopulse 


radar (Levanon 1988) for angle estimation purposes. For this reason, we refer to it as the 
difference beamformer. In relation to the sum beamformer, we can easily verify that 


vi (b)vz(@) = 0 (1.7.12) 


that is, the two beamformers are orthogonal to each other. The fact that the two beamformers 
are orthogonal to each other means that, in terms of the signal s, the two beamformers can 
make two independent measurements of the signal. These independent measurements allow 
for the discrimination of the angle. 

Using these two steering vectors vj(@) and vs(@), we can form an adaptive sum 
beamformer 


where 


(11.7.11) 


cx(¢) = Rj |,Vvz() (11.7.13) 


We use the term beamformer for interpretation of the Cramér-Rao bound only. No actual beams are formed since 
the CRB is only a performance bound and not a processing technique. 


and an adaptive difference beamformer 


ca(o) = Riva) (11.7.14) 


which both have not been normalized to satisfy any particular criteria. Proceeding, we can 
compute the power of the interference-plus-noise output of these two beamformers 


Py =c#Rines Pa = Riga (11.7.15) 


Similarly, we can measure the normalized cross-correlation py, of the interference-plus- 
noise outputs of these adaptive sum and difference beamformers Rj+, 


H 2 
2 les Ripnal 
= + 11.7.16 
DA Pape ( ) 
Using (11.7.15) and (11.7.16), the CRB on angle estimation for a ULA is given by ' 
2 1 
(11.7.17) 


o% > 
2m? - SNRo- Pa(1 — p},) cos? ¢ 


where SNRo is the SNR for a spatial matched filter from (11.2.16) in the absence of inter- 
ference, that is, noise only, which is given by 

pe) 

SNRo = M— (11.7.18) 

ow 
The CRB on angle accuracy has several interesting interpretations. First and foremost, as 
the signal power increases in value, SNRo increases; as a result, angle accuracy improves. 
Intuitively, this result makes sense as the stronger the signal of interest, the better the angle 
estimate should be. Likewise, the term cos” @ simply represents the increase in beamwidth 
of the ULA as we steer away from broadside (@ = 0°). The interpretation of the other 
terms Pa and | — Py a may be less obvious, but also provides insight. Here Pa provides 
a measure of the received power aligned with the adaptive difference beamformer. On the 
other hand, ps.,q is the cross-correlated energy between the adaptive sum and difference 
beamformers. Ideally, ps, is zero, since cy and ca beamformers are derived from vy 
and va, respectively, which are orthogonal to each other. In the case of the two adaptive 
beamformers, the adaptation will remove this orthogonality, but the beamformers should 
be different enough that py, < 1. Otherwise, angle accuracy will suffer. 


11.7.3 Beamsplitting Algorithms 


Let us consider the scenario with a single beamformer steered to an angle ¢9 with our 
signal of interest at angle @,. The beamformer passes all signals within its beamwidth with 
only slight attenuation of signals that are not directly at the center of the beam steered to 
do. Clearly, this single beamformer cannot discriminate between signals received within 
its beamwidth. However, we desire a more accurate estimate of the angle of the signal of 
interest than simply the beamwidth of our beamformer. Thus, any angle estimator must 
achieve finer accuracy than the beamwidth, and as a result angle estimation algorithms are 
commonly referred to as beamsplitting algorithms. 

To construct an angle estimation algorithm, it is necessary to obtain different measure- 
ments of the signal of interest in order to determine its angle. These measurements allow 
an angle estimation algorithm to discriminate between returns that arrive at the array from 
different angles. To this end, we use a set of beamformers steered in the general direction of 


"This formulation assumes unit-variance thermal noise power. Therefore, if signals have different thermal noise 
power, the correlation matrix must be normalized by the thermal noise power prior to computing Pa and py a. 
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the signal of interest but with different spatial responses, that is, beampatterns. One means 
of obtaining different measurements of the signal of interest is to slightly offset the steering 
direction of two beamformers. For example, we might form two beams at angles 


o, = oo — Ad $2 = $9 + Ad (11.7.19) 


where A@ is a fraction of a beamwidth, for example, half a beamwidth. Let the weight 
vectors for these two beamformers be c; and ¢2, respectively. These two beamformers can 
be either nonadaptive, as in the case of the conventional beamformers discussed in Section 
11.2, or one of the various adaptive beamformers from Section 11.3, 11.5, or 11.6. Ideally, a 
pair of adaptive beamformers is used for applications in which interference is encountered. 
Since the two beamformers are slightly offset from angle @p, they may be thought of as 
“left” and “right” beamformers. Using the beamformer weight vectors, we can then form 


the ratio 
ellx 
x= (11.7.20) 
cy x 
where recall that x is the snapshot under consideration that contains the signal of interest 
s = /Mo,v(q@,). Similarly, we can also hypothesize this ratio for any angle ¢@ to form a 


discrimination function 
_ ef'v(@) 
AC) 
Comparing the value of the measured ratio in (11.7.20) for the snapshot x to this angular 
discrimination function in (11.7.21), we obtain an estimate of the angle of the signal of 
interest ¢,. The key requirement for the discrimination function is that it be monotonic 
over the angular region in which it is used; that is, there is a one-to-one correspondence 
of the function in (11.7.21) and every angle in this region. The angular region typically 
encompasses the beamwidths of the two beamformers. This requirement on the discrimi- 
nation function y (¢) means that the two beamformers c; and c2 must have different spatial 
responses. 

We have simply given an example of how an angle estimation algorithm might be 
constructed. The topic of angle estimation is a very large area, and the choice of algorithm 
should be determined by the particular application. In the example given, we constructed 
a beamsplitting algorithm with left and right beams. Similarly, we could have chosen sum 
and difference beams, as is commonly done in radar in a technique known as monopulse 
(Sherman 1984). In fact, sum and difference beams can be formed from left and right beams 
by taking their sum and difference, respectively. In this case, a simple linear transformation 
exists that provides a mapping between the two beam stategies, and as a result one would 
anticipate equivalent performance. For further material on angle estimation algorithms, the 
interested reader is referred to Davis et al. (1974), McGarty (1974), Zoltowski (1992), and 
Nickel (1993). 


v(o) (11.7.21) 


11.7.4 Model-Based Methods 


In Section 9.6, we discussed frequency estimation techniques based on a model of a complex 
exponential contained in noise. Certainly all these techniques could also be applied to 
the angle estimation problem, particularly for a ULA that has a direct correspondence 
to a discrete-time uniformly sampled signal. In this case, the angle is determined by the 
spatial frequency of the ULA. These methods are commonly referred to as superresolution 
techniques because they are able to achieve better resolution than traditional, nonadaptive 
methods. In fact, many of these techniques were originally proposed for array processing 
applications. However, certain considerations must be taken into account when one is trying 


to apply these methods for use with a sensor array. First, a certain amount of uncertainty 
exists with respect to the exact spatial location of all the sensors. All these methods exploit 
the structure imposed by regular sampling where knowledge of the sampling instance is 
very precise. In the case of a temporally sampled signal, this assumption is very reasonable; 
but in the case of an array with these uncertainties, the validity of this assumption must be 
called into question. In addition, for a sensor array, all the signals are measured by different 
sensors with slightly different characteristics, as opposed to a temporally sampled signal 
for which all the samples are measured by the same sensor (analog-to-digital converter). 
Although these channel mismatches can be corrected for in theory, a perfect correction is 
never possible. For this reason, caution is in order when using these model-based methods 
for the purposes of angle estimation with an array. 


11.8 SPACE-TIME ADAPTIVE PROCESSING 


Space-time adaptive processing (STAP) is concerned with the two-dimensional processing 
of signals in both the spatial and temporal domains. The topic of STAP has received a 
lot of attention recently as it is a natural extension of array processing (Ward 1994, 1995; 
Klemm 1999). Although discussions of STAP date back to the early 1970s (Brennan and 
Reed 1973), the realization of STAP in an actual system was not possible until just recently, 
due to advances that were necessary in computing technology. We give a brief overview of 
the principles of STAP and cite some of the considerations for its practical implementation. 
Although STAP has also been proposed for use in communications systems (Paulraj and 
Papadias 1997), we primarily discuss it in the context of the airborne radar application for 
the purposes of clutter cancelation (Brennan and Reed 1973; Ward 1995; Klemm 1999). 
A general STAP architecture is shown in Figure 11.29. Consider a ULA of sensors as 
we have discussed throughout this chapter. We choose a ULA for the sake of simplicity, but 
note that STAP techniques can be extended for arbitrary array configurations. In addition, 
the signal from each sensor consists of a set of time samples or delays that make up a 


Delays (time) 
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FIGURE 11.29 
Space-time adaptive processing. 
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time window. In radar applications, the time samples represent the returns from a set of 
transmitted pulses. For an airborne radar that moves with a certain velocity, the reflected 
signals from moving and nonmoving objects have a single frequency across the pulses. 
The pulse frequency results in a complex exponential across the pulses. The frequency is 
produced by the relative velocity of the objects with respect to the array and is known as 
the Doppler frequency. Thus, we wish to construct a space-time model for a signal received 
from a certain angle ¢, at a frequency f;. We model the spatial component of the signal 
using the spatial (sp) steering vector for a ULA with M sensors from (11.1.23) 


1 ' : : 
Vsp (bs) = wis oe J2al(d/4) sing] |. ge FGA) sng\ TP (11.8.1) 


Likewise, the temporal component of a signal that is a complex exponential can be modeled 
using a data window frequency vector, which technically is a temporal frequency steering 
vector. This temporal steering vector is given by 


Vtime(f) = nas CIF. ce gE (11.8.2) 


where L is the number of time samples or pulses. Both Vsp and Viime have unit norm, that is, 
va, Vsp = 1 and viz eVtime = |. Using these two one-dimensional steering vectors, we can 
form the two-dimensional LM x | steering vector known as a space-time steering vector 


vst(h, f) = Vtime(S) ® Vsp (9) (11.8.3) 


where ® is the Kronecker product (Golub and Van Loan 1996). This vector, like the two 
one-dimensional steering vectors, has unit norm. Using this space-time steering vector, we 
can then model a spatially propagating signal arriving at the ULA from an angle ¢, witha 
frequency f; as 


s(n) = VLMVva(¢s, fs)5() (11.8.4) 


Of course, this signal of interest s(7) is not the only signal since the ULA at the very least 
will have thermal noise from the sensors. However, let us consider the case where in addition 
to the signal of interest, the ULA receives other spatially propagating signals that constitute 
interference i(n). Thus, the overall space-time signal in the ULA is 


x(n) = s(n) + i(n) + w(n) (11.8.5) 


where w(7) is the sensor thermal noise space-time signal that is both temporally and spatially 
uncorrelated, that is, E{(w(n)w” (n)} = of. The interference component is made up of 
spatially propagating signals that may be temporally uncorrelated or consist of complex 
exponentials in the time domain, just as the signal of interest. In the case of an airborne 
radar, jamming interference is temporally uncorrelated while spatially correlated; that is, the 
jamming signal consists of uncorrelated noise that arrives from a certain angle ¢. However, 
clutter returns are produced by reflections of the radar signal from the ground and have 
both spatial and temporal correlation. Due to the nature of the airborne radar problem, 
these clutter returns exhibit a certain structure that can be exploited for the purposes of 
implementing a STAP algorithm (Ward 1995; Klemm 1999). 

As we did for the optimum beamformer, we want to find the optimum STAP weight 
vector. The optimality condition is again the maximization of the output SINR. The space- 
time interference-plus-noise correlation matrix is 


Risn = Efxipn()xf,)} (11.8.6) 
where Xitn(1) = i(n) + w(n) (11.8.7) 
is the interference-plus-noise component of the signal. The availability of data that do not 
contain the signal of interest is a training issue for the implementation of the STAP algorithm 


that we do not consider here. See Borsari and Steinhardt (1995) and Rabideau and Steinhardt 
(1999). 


The optimum STAP weight vector is found in a similar fashion to the optimum beam- 
former in Section 11.3. Using a unit gain on target constraint, the optimum STAP weight 
vector is 


Ri AV(Gs, fi) 
v4 ($5, foRignV(bs. fs) 


where the space-time steering vector v(¢,, f;) specifies the angle and frequency of the 
presumed signal of interest s(n). The implementation of STAP requires the estimation of 
Rj4n from data samples in order to compute the sample correlation matrix Risa. This 
block adaptive implementation is also known as sample matrix inversion (SMI). SMI was 
discussed in the context of adaptive beamforming in Section 11.5. 

The adaptive degrees of freedom of full STAP as specified in (11.8.8) are LM. For most 
applications, computational considerations as well as a limited amount of data to train the 
adaptive weights make the implementation of fully dimensional STAP impractical. Thus, 
we must consider reduced-dimension versions of STAP (Ward 1994, 1995). To this end, 
a preprocessing stage precedes the adaptation that reduces the degrees of freedom to an 
acceptable level. The most commonly considered approaches use a partially adaptive array 
implementation, as discussed in Section 11.6.2, either beamspace or subarrays, to reduce 
the spatial degrees of freedom. Temporal degrees of freedom can be reduced by using a 
frequency-selective temporal or Doppler filter (Ward and Steinhardt 1994). Alternatively, 
a subset of the total number of pulses can be used where the subsets of pulses are then 
combined following adaptive processing (Baranoski 1995). 

A brief mention should be given to the application of STAP to the communications 
problem (Paulraj and Papadias 1997). Unlike in the radar application, it is generally not 
possible to separate the signal of interest from the interference-plus-noise in communica- 
tions applications. In addition, although as in radar the signals are spatially propagating and 
thus arriving at the sensor array at a specific angle, for communications the temporal signals 
are not necessarily complex exponential signals. Instead, many times the signals consist 
of coded sequences. In this case, STAP must incorporate the codes rather than complex 
exponentials into its processing for the proper extraction of these signals. 


Cstap = (11.8.8) 


11.9 SUMMARY 


In this chapter, we have given a brief overview of array processing, starting with array 
fundamentals and covering optimum and adaptive beamforming. Throughout the chapter, 
we focused on the ULA, but it is possible to extend these concepts to other array config- 
urations. The background material included spatially propagating signals, the concepts of 
modulation and demodulation, and the model for a spatial signal received by a ULA. We 
drew the analogy between the ULA, in terms of spatial sampling, and the discrete-time 
sampling of temporal signals. Next, we introduced the topic of conventional beamforming 
for which we discussed the spatial matched filter, which maximizes the SNR in the absence 
of interference, and tapered low sidelobe beamformers. Within this context, we looked at 
the characteristics of an array in terms of resolution and ambiguities known as grating lobes. 

The remainder of the chapter dealt with optimum and adaptive beamforming techniques 
related to the processing introduced earlier in the book for use with discrete-time signals. 
These methods are concerned with adapting to the characteristics of the data, either assuming 
knowledge of these characteristics (optimum) or estimating them from the data (adaptive). 
One might say that the fundamental equation in adaptive beamforming is ce = R~'v(@,), 
where v(@,) determines the direction @, in which we are steering and R™|, the inverse of 
the array correlation matrix, performs the adaptation to the array data. Within this context, 
we looked at various issues, such as sidelobe levels and interference cancelation, and the 
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effects of signal mismatch and bandwidth. The more advanced topics of angle estimation 
and STAP were also discussed. 

Throughout this chapter, we have tried to remain general in our treatment of the array 
processing principles. Ultimately, specific issues and concerns related to the application will 
dictate the type of processing that is needed. Areas in which arrays are commonly employed 
include radar, sonar, and communications. In parts of the chapter, we have used examples 
based on radar applications since they tend to be the easiest to simplify and describe. Other 
important issues not discussed that arise in radar include the nonstationarity of the signals as 
well as training strategies. The sonar application is rich with issues that make the implemen- 
tation of adaptive arrays a very challenging task. Propagation effects, including signal mul- 
tipath, lead to complicated models that must be used to estimate the steering vectors. In addi- 
tion, signals of interest tend to be present at all times, so that the adaptive beamformer must 
be trained with the signal of interest present. As we described in Section 11.4.1, this situation 
leads to a heightened sensitivity to signal mismatch. For more details see Baggeroer et al. 
(1993). Arrays for communications applications have also become a very popular field ow- 
ing to the rapid growth of the wireless communications industry. The fundamental issue for 
wireless communications is the number of users that can be simultaneously accommodated. 
The limitations arise from the interference produced by other users. Arrays can help to in- 
crease the capacity in terms of the number of users. For more details, see Litva and Lo (1996). 

We have presented some material on the more advanced topics of angle estimation and 
STAP. Another extension of adaptive beamforming is the subject of adaptive detection. This 
topic is concerned with the determination of the presence of signals of interest in which the 
detector is determined adaptively from the data. References on this subject include Kelly 
(1986), Steinhardt (1992), Robey et al. (1992), Bose and Steinhardt (1995), Scharf and 
McWhorter (1997), Kreithen and Steinhardt (1996), Conte et al. (1996), and Richmond 
(1997). 


PROBLEMS 


11.1 Consider a narrowband spatially propagating signal with a speed of propagation c. The signal 
impinges on an M = 2 element ULA from an angle ¢ = 0° with a spacing d between the 
elements. For illustration purposes, let the temporal content of the signal be a pulse. 


(a) Let the time of arrival of the pulse at the first sensor be t = 0. At what time does the signal 
arrive at the second sensor? 

(b) Do any other angles ¢@ produce the same delay between the two sensors? Why? 

(c) Suppose now that we only have a single sensor. Can we determine the angle from which 
a signal impinges on this sensor? 


11.2 We want to investigate the use of a mechanically steered versus an electronically steered array. 
Consider a spatial matched filter with M = 20 4/2-spaced elements. Now consider that the 
array is steered to @ = 45°. In the case of mechanical steering, the pointing direction is 
always broadside to the array. To compute the beampattern of the mechanically steered array, 
simply take the beampattern computed at @ = 0° and shift it by the mechanical steering 
angle, that is, ¢’ = ¢ + dmech. However, the beampattern of an electronically steered array is 
simply the beampattern of the spatial matched filter steered to the desired angle. Compare the 
beampattern of the mechanically steered array to that of the electronically steered array. What 
do you observe? Repeat this for ¢ = 60° steering, both electronic and mechanical. 


11.3 In this problem, we want to explore the use of beampatterns and steered or spatial responses of 
a ULA. Consider a signal x(n) consisting of two spatially propagating signals from ¢; = —10° 
and @2 = 30°, both made of random, complex Gaussian noise. The respective powers of the 
two signals are 20 and 25 dB. The number of sensors in the ULA is 50, and its thermal noise 
level is 0 dB. The ULA has interelement spacing of d = 4/2. 


(a) Compute one realization of x(n) for N = 1000 samples, that is, 1 < n < N. Using a 
spatial matched filter, compute a steered response for this signal from the beamformer 


11.4 


11.5 


11.6 


11.7 


11.8 


outputs, and plot it in decibels versus angle. What do you observe? Compare the result to 
the expected steered response using the true correlation matrix. 

(b) Compute and plot the beampattern for the spatial matched filter steered to ¢ = 30°. How 
can you relate the power levels you observed in part (a) at the angles of the two signals to 
the beampattern? 

(c) Change the power level of the signal at ¢ = —10° to 60 dB, and compute the steered 
response. What do you observe? What do you recommend in order to distinguish these two 
signals? Implement your idea and plot the estimated steered response from the beamformer 
outputs. 


Suppose that we have an M = 30 element ULA with a thermal noise level of 0 dB. 


(a) Generate a realization of the ULA signal x(n) consisting of two random, complex Gaussian 
signals at @ = 0° and ¢@ = 3° both with power 20 dB, along with the sensor thermal 
noise. The interelement spacing is d = 4/2. Let the number of samples you generate be 
N = 1000. Compute and plot the steered response of x(n) using a spatial matched filter. 
What do you observe? 

(b) Repeat part (a) for an M = 60 element ULA. What do you observe? 

(c) Now using the M = 30 element ULA again, but with interelement spacing d = A, compute 
the steered response and comment on the result. 

(d) Compute the beampatterns for the spatial matched filter steered to ¢ = 0° for the three 
array configurations in parts (a), (b), and (c). 


In this problem, we want to investigate the use of randomly thinned arrays. Note that the M = 30 
element ULA with d = i spacing from Problem 11.4 is simply the M = 60 element ULA 
with every other element deleted. Such an array is often referred to as a thinned array. Using 
an M = 60 element array, randomly thin the array. (Hint: Use a random number generator.) 
First thin to 75 percent (45 elements) and then to 50 percent (30 elements). Compute and plot 
the steered response, using a spatial matched filter for the signal in Problem 11.4. Note that 
the spatial matched filter must take into account the positions of the elements; that is, it is 
no longer a Vandermonde steering vector. Compute the beampatterns of these two randomly 
thinned arrays. Repeat this process 3 times. What do you observe? 


The spatial matched filter from (11.2.16) is the beamformer that maximizes the SNR in the 
absence of interference. For this spatial matched filter, the beamforming or array gain was 
shown to be Gpg = M. Suppose now that we have an M = 20 element ULA in which the 
elements have unequal gain. In other words, the spatial matched filter no longer has the same 
amplitude in each element. Find the spatial matched filter for the case when all even-numbered 
elements have a unity gain, while all the odd-numbered elements have a gain of 2. What is the 
beamforming gain for this array? The noise has equal power in all elements. 


The optimum beamformer weights with MVDR normalization are found by solving the fol- 
lowing optimization 


min Pi4y subject to cv(g,) =1 
Using Lagrange multipliers discussed in Appendix B, show that the MVDR optimum beam- 
former weight vector is 
=] 
« - —_RaaVos) 
nas har ee, Rae ene 
vt (b)Ri Vs) 


In this problem, we want to investigate the different normalizations of the optimum beamformer 
from Table 11.1.We refer to the three normalizations as MVDR (@ = [wi (ds JR V(Gs)]~1), 


adaptive matched filter or AMF (a = [v! @ RG Vs 2), and unit gain on noise (@ = 


v4 Ri, 
sources at ¢ = 45° and ¢ = 20° with powers of 30 and 15 dB, respectively. The noise power 
is oO, = 1. Now compute the steered response of the optimum beamformers with the three 


different normalizations between —90° < @ < 90° (using 1° increments), using the true 


v($,)]7!/ 2). Let the interference-plus-noise signal consist of two interference 
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11.9 


11.10 


11.11 


11.12 


11.13 


11.14 


11.15 


correlation matrix. What is the difference in the outputs of the optimum beamformers with the 
different normalizations? For what purposes are the different normalizations useful? 


The generalized sidelobe canceler (GSC) was derived as an alternative implementation of the 
MVDR optimum beamformer. Show that the overall end-to-end weights associated with the 
GSC are equivalent to the MVDR optimum beamformer weight vector in (11.3.15). 


In the formulation of the optimum beamformer, we used the interference-plus-noise correla- 
tion matrix R;,,. However, in many applications it is not possible to have interference-plus- 
noise-only data, and the signal of interest is always present. Thus, the beamformer must be 
implemented using the correlation matrix 


Ry = Rign + 02V(G,)v" (G5) 


Show that the use of this correlation matrix will have no effect on the optimum beamformer 
weight vector for the case of no signal mismatch. Hint: Use the matrix inversion lemma (see 
Appendix A). 


In this problem, we want to look at the effect of signal mismatch on the performance of 
the optimum beamformer. Of course, the resulting beamformer is no longer really optimum, 
but instead is optimized to our presumptions about the signal. Consider the case with three 
interference sources at @ = 5°, 20°, and —30° with powers of 25, 35, and 50 dB, respectively. 
Compute the optimum beamformer steered to ¢ = 0°. Now consider the case where the signal 
of interest is not at @ = 0° but rather at ¢ = —1°. The array consists of an M = 50 element 
ULA with a noise power of oe Sel: 


(a) Find the signal mismatch loss when the signal of interest is not in the correlation matrix. 
Vary the strength of the signal from 0 to 30 dB. 

(b) Find the signal mismatch loss when the signal is in the correlation matrix. Vary the strength 
of the signal from 0 to 30 dB. 


Let us again consider a set of three interference sources at ¢ = 5°, 20°, and —30° with powers 
of 25, 35, and 50 dB, respectively. Now consider the case where the signal of interest is not 
at @ = 0° but rather at @ = —1° and has a power of Moz = 20 dB. The array consists of 
an M = 50 element ULA with a noise power of os, = 1. However, instead of computing the 
optimum beamformer with the correlation matrix Rj, use the diagonally loaded interference- 
plus-noise matrix 


R, =Ri, +71 


where oF is the loading level. 


(a) Find the signal mismatch loss when the signal of interest is not in the correlation matrix. 
Compute and plot the mismatch loss varying the diagonal loading level from 0 to 20 dB. 

(b) Find the signal mismatch loss when the signal is in the correlation matrix. Compute and 
plot the mismatch loss varying the diagonal loading level from 0 to 20 dB. 


The Frost sample-by-sample adaptive beamformer was derived for the MVDR beamformer. 
Extend the Frost sample-by-sample adaptive beamformer for the case of multiple constraints 
in an LCMV adaptive beamformer. 


The LCMV beamformer weight vector is given in (11.6.7) and was found by using Lagrange 
multipliers, which are discussed in Appendix B. Verify this result; that is, using Lagrange 
multipliers, show that the LCMV beamformer weight vector is given by 


-1 = AS 
Clomy = RA, C(C7 RA O)7'8 
where C and 6 are defined as in Section 11.6.1. 


Let us consider the sidelobe canceler, as discussed in Section 11.6.3. We restrict the problem 
to a single interferer that has an angle ¢; with respect to a ULA that makes up the auxiliary 


11.16 


11.17 


11.18 


11.19 


11.20 


channels. The main channel consists of the signal 
Xme(1) = gss(n) + ime(2) + Wme(n) 


where imc(”) = g;i(n) is the temporally uncorrelated signal i() with unit variance that has a 
main channel gain of g;. The main channel thermal noise wmc(”) is temporally uncorrelated 
with a power of OG. The auxiliary channels make up an M-element ULA with thermal noise 


Oo. The auxiliary channel signal vector is given by 
x(n) = s(n)v(Ps) + oj (n)V(G;) + WH) 


where @, and ¢; are the angles of the signal of interest and the interferer with respect to the 
ULA, respectively. 


(a) Form the expressions for the auxiliary channel correlation matrix Ra and cross-correlation 
vector I'ma that include the signal of interest in the auxiliary channels. 

(b) Compute the output power of the interference-plus-noise. 

(c) Compute the output power of the signal. What conclusions can you draw from your answer? 


The MVDR optimum beamformer is simply a special case of the LCMV beamformer. In this 
case, the constraint matrix is C = v(@,) and the constraint response vector is 6 = 1. 


(a) Using the LCMV weight vector given in (11.6.7), substitute this constraint and constraint 
response and verify that the resulting beamformer weight vector is equal to the MVDR 
optimum beamformer. 

(b) Find an expression for the interference-plus-noise output power of the LCMV beamformer. 


The optimum beamformer could also be formulated as the constrained optimization problem 
that resulted in the MVDR beamformer. This beamformer can be implemented as a generalized 
sidelobe canceler (GSC), as shown in Section 11.3.5. Similarly, the LCMV beamformer can 
be implemented in a GSC architecture. Derive the formulation of a GSC with multiple linear 
constraints. 


Consider the case of an M = 20 element array with d = 4/2 interelement spacing and thermal 
noise power oo = |. An interference source is present at ¢ = 30° with a power of 50 dB. 
Generate one realization of 1000 samples of this interferer. In addition, a signal of interest is 
present at d, = 0° with a power of os = 100 (20 dB) in the n = 100th sample only. 


(a) Using an SMI adaptive beamformer for the full array, compute the output signal. Is the 
signal of interest visible? 

(b) Using a partially adaptive beamformer with Q = 4 nonoverlapping subarrays with M=5 
elements, compute the output of an SMI adaptive beamformer. What can you say about 
the signal of interest now? 

(c) Repeat part (b) with Q = 2 and M = 10. What are your observations now? 


Consider the case of an M = 40 element array with d = 4/2 interelement spacing and thermal 
noise power o%, = 1. An interference source is present at 6 = 20° with a power of 50 dB. 
Generate one realization of 1000 samples of this interferer. In addition, a signal of interest is 


present at d, = 0° with a power of os = 100 (20 dB) in the n = 100th sample only. 


(a) Using an SMI adaptive beamformer for the full array, compute the output signal. Is the 
signal of interest visible? 

(b) Using a beamspace partially adaptive beamformer consisting of 11 beams at the angles 
—5° < @ < 5° at 1° increments, compute the output of a partially adaptive SMI beam- 
former. What can you say about the signal of interest now? 

(c) Repeat part (b) with beams only at ¢ = —1°, 0°, and 1°. What are your observations now? 


Compute the SINR loss for a partially adaptive beamformer with a general preprocessing 
transformation T. You need to start with the general definition of SINR loss 


a SINRout @s) 
SNRo 


Lsinr = 
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11.21 Consider the case of an interference source at ¢ = 30° with a power of 40 dB. The ULA is 
a 20-element array with d = i/2 interelement spacing and has unit-variance thermal noise 
(a2, = 1). 


(a) Compute the SINR loss for the optimum beamformer for —90° < ¢ < 90°. 

(b) Let us consider the case of the GSC formulation of the optimum beamformer. If we choose 
to use a beamspace blocking matrix B in (11.3.41), what are the spatial frequencies of 
the spatial matched filters in this blocking matrix for an optimum beamformer steered to 
¢=0°? 

(c) To implement a reduced-rank or partially adaptive beamformer, use only the two spatial 
matched filters in the beamspace blocking matrix with spatial frequencies closest to the 
interference source (spatial frequency u = 5 sing; = 0.25). Compute the SINR loss 
of this partially adaptive beamformer, and compare it to the SINR loss of the optimum 
beamformer found in part (a). 


CHAPTER 12 


Further Topics 


The distinguishing feature of this book, up to this point, is the reliance on random process 
models having finite variance and short memory and specified by their second-order mo- 
ments. This chapter deviates from this path by focusing on further topics where there is an 
explicit or implicit need for higher-order moments, long memory, or high variability. 

In the first part (Section 12.1), we introduce the area of higher-order statistics (HOS) 
with emphasis on the concepts of cumulants and polyspectra. We define cumulants and 
polyspectra; we analyze the effect of linear, time-invariant systems upon the HOS of the 
input process; and we derive the HOS of linear processes. Higher-order moments, unlike 
second-order moments, are shown to contain phase information and can be used to solve 
problems in which phase is important. 

In the second part (Sections 12.2 through 12.4), we illustrate the importance of HOS for 
the blind deconvolution of non-minimum-phase systems, and we show how the underlying 
theory can be used to design unsupervised adaptive filters for symbol-spaced and fractionally 
spaced equalization of data communication channels. 

In the third part (Sections 12.5 and 12.6), we introduce two types of random signal mod- 
els characterized by long memory: fractional and self-similar, or random, fractal models. 
We conclude with rational and fractional models with symmetric a-stable (SaS) excitations 
and self-similar processes with SwS increments. These models have long memory and find 
many applications in the analysis and modeling of signals with long-range dependence and 
impulsive or spiky behavior. 


12.1 HIGHER-ORDER STATISTICS IN SIGNAL PROCESSING 


The statistics of a Gaussian process are completely specified by its second-order moments, 
that is, correlations and power spectral densities (see Section 3.3). Since non-Gaussian 
processes do not have this property, their higher-order statistics contain additional informa- 
tion that can be used to measure their deviation from normality. In this section we provide 
some background definitions and properties of higher-order moments, and we discuss their 
transformation by linear, time-invariant systems. More detailed treatments can be found in 
Mendel (1991), Nikias and Raghuveer (1987), Nikias and Mendel (1993), and Rosenblatt 
(1985). 


12.1.1 Moments, Cumulants, and Polyspectra 


The first four moments of a complex-valued stationary stochastic process are defined by 
ro & E{x(n)} = wy (12.1.1) 
rh) = Efx*(n)xa + 1)} = re) (12.1.2) 
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ry, bh) & E{x*(n)x(n +h)x(n + b)} (12.1.3) 
rYib,k) * Efx*(n)x* (nt l)x(n+h)x+h)} (12.1.4) 


although other definitions are possible by conjugating different terms. We note that the first 
two moments are the mean and the autocorrelation sequence, respectively. 

In Section 3.2.4 we showed that the cumulant of a linear combination of IID random 
variables can be determined by a linear combination of their cumulants. In addition, in 
Section 3.1.2, we noted that the kurtosis of a random variable measures its deviation from 
Gaussian behavior. For these reasons, we usually prefer to work with cumulants instead of 
moments. Since higher-order cumulants are invariant to a shift of the mean value, we define 
them under a zero-mean assumption. 

The first four cumulants of a zero-mean stationary process are defined by 


KY — E{x(n)} = pw, =0 (12.1.5) 
KOU) = E{x*)x(n + h)} = rel) (12.1.6) 
KO, b) = Ex (n)xnth)x(n+h)} (12.1.7) 


KO (Lh, bh) = Efx*(n)x*(n t+ h)x(n + b)x(n +)} 
= KP (b)KO( — hy) — KP (b) KO (bh — hh) (12.1.8) 
(complex-valued case) 
KOU, bb) = Elx(n)x(nt+h)xQn t+ b)x(n + 13)} — KO KO (b - b) 
— KO (Ip) ( — Ly) — KO (bk (ly — hh) (12.1.9) 
(real-valued case) 


and can be obtained by using the cumulant-generating function discussed in Section 3.1.2 
(Mendel 1991). It can be shown that 


CO Usb ou hiy= ul Gs bs 21) =F dab) KSA 
(12.1.10) 

where x (7) is anon-Gaussian process and g (7) is a Gaussian process with the same mean and 
autocorrelation sequence. The negative terms in (12.1.8) and (12.1.9) express the fourth- 
order cumulant of the Gaussian process in terms of second-order ones. In this sense, in 
addition to higher-order correlations, cumulants measure the distance of a process from 
Gaussianity. Note that if x(n) is Gaussian, «“ (1), lb, ..., 1,1) = 0 for all k > 3 even if 
Equation (12.1.10) holds only for k = 3, 4. 

If we assume that jz, = 0 and set J) = Jy = 15 = 0 in (12.1.6) through (12.1.8), we 
obtain 


k(0) = E{|x(n)?} = 02 (12.1.11) 
Coy) (12.1.12) 

«40,0, 0) = E{|x(n)|*} — 204 complex (12.1.13) 
= E{x*(n)}—307 real (12.1.14) 


which provide the variance, unnormalized skewness, and unnormalized kurtosis of the 
process (see Section 3.1.2). 

If the probability distribution of a process is symmetric (e.g., uniform, Gaussian, 
Laplace), its third-order cumulants are zero. In such cases, we need to consider fourth- 
order cumulants. Higher-order cumulants (k > 4) are seldom used in practice. 

If the cumulants are absolutely summable, we can define the kth-order cumulant spec- 
tra, higher-order spectra, or polyspectra as the (k — 1)-dimensional Fourier transform of 
the kth-order cumulant. More specifically, the power spectral density (PSD), bispectrum, 
and trispectrum of a zero-mean stationary process are defined by 


[o,@) 
Re) = Sere = he) (12.1.15) 


1;=—0o 


CO [oe 
RE) (ef, ef2) & > KO, bye Jeri terla) (bispectrum) 


1, =—00 Ih=—00 
(12.1.16) 
[oe [oe CO 
and RY (e/1 ; ef 2, e/®3) A ye ye by Od, b, Ig)e J hi teal) 
1, =-—00 In =—00 13=—00 
(trispectrum) (12.1.17) 


where @ 1, @2, and w3 are the frequency variables. In contrast to the PSD, which is real- 
valued and nonnegative, both the bispectrum and the trispectrum are complex-valued. Since 
the higher-order cumulants of a Gaussian process are zero, its bispectrum and trispectrum 
are zero as well. 

Many symmetries exist in the arguments of cumulants and polyspectra of both real and 
complex stochastic processes (Rosenblatt 1985; Nikias and Mendel 1993). For example, 
from the obvious symmetry 


rh, b) =r, h) (12.1.18) 
we obtain RE) (ef 21, ef2) = RO) (eI, e/1) (12.1.19) 
which is a basic property of the bispectrum. 
For real-valued processes, we have the additional symmetries 


19 (hb) = r@'\(—h, hy — bh) =r (—h, b — hy) saat 
= 12 (hb —h, —h) =r — b, -b) 


rb, b) =r bbb) =r, bb) 
=r (-h, h-hh, -h) 
which can be used to simplify the computation of cumulants. It can be shown that the nonre- 
dundant region for r (1;, I) is the wedge {(I1, In) 0 <b < ly < co}andforr\” (1y, b, 13) 


is the cone {(/1, 2,13) :0< <I. <1, < cw}. The symmetries of cumulants impose sym- 
metry properties upon the polyspectra. Indeed, by using (12.1.20) it can be shown that 


RB (ef, e/@2) = RD (eI, e/°1) = RE) (ef, e~JO1—Jory 


(12.1.21) 


by & ; (12.1.22) 
= R®) (e~J@1-Jo2  gJor) — RO* (ef ,e 402) 
which implies that the nonredundant region for the bispectrum is the triangle with vertices at 
(0, 0), (22 /3, 27/3), and (zr, 0). The trispectrum of a real-valued process has 96 symmetry 
regions (Pflug et al. 1992). 
Finally, we note that if in (12.1.7) we replace x(n + 1)) by y(n + 1) and x(n + I2) by 
z(n + ly), we can define the cross-cumulant and then take its Fourier transform to find the 
cross-bispectrum. These quantities are useful for joint signal analysis (Nikias and Mendel 
1993). 


12.1.2 Higher-Order Moments and LTI Systems 


Consider a BIBO stable linear, time-invariant (LTI) system with impulse response h(n) and 
input-output relation given by 
[oe 
y(n) = S h(k)x(n — k) (12.1.23) 


k=—0o 
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If the input x(7) is stationary with zero mean, the output autocorrelation is 
oo= ye S h(k,)h* (ko)rx(l — kt — ko) (12.1.24) 
ko ky 
where the range of summations, which is from —oo to 00, is dropped for convenience (see 
Section 3.4). Also we have 


Ry(e/®) = |H(e!®) |? Rx (e/”) (12.1.25) 
which shows that the output PSD is insensitive to the phase response of the system. 


Using (12.1.23) and (12.1.7), we can show that the input and output third-order cumu- 
lants are related by 


tO (bb) = S00 SO a Koyh(kih ka) (hh = ki + ko, lo — ka + ko) (121.26) 
ko ky ko 
To obtain the fourth-order cumulant of the output, we first determine its fourth-order moment 
rh, by) = S20 SOC a Koyh* kiyh(ka)h(ks) 
ko ky ky kg (12.1.27) 
x Ml, — ky + ko, bo — ka + ko, 1s — kg + ko) 


using (12.1.23) and (12.1.4) (see Problem 12.1). Then using (12.1.8), (12.1.9), and (12.1.24), 
we have 


«by = > 2 SO Yh Koyh* (kh (ka) h(ks) 
ko kt ko kg (12.1.28) 
x KO — ky + ko, bo — ko + ko, 13 — k3 + ko) 
which holds for both real- and complex-valued processes. An interesting interpretation of 
this relationship in terms of convolutions is given in Mendel (1991) and in Therrien (1992). 
We now compute the bispectrum of the output signal y(n) in terms of the bispectrum 
of the input x(n) and the frequency response H (e/”) of the system. Indeed, taking the two- 
dimensional Fourier transform of (12.1.26) and interchanging the order of summations, we 
have 


Re eI?) = Ds S eo dis In)e J @ih tents) 
i hb 


= VU VEY i oni yheka) 


Nob ky ky ky 
x KO (Ly — ki + ko, ba — ka + koe Fun toh) 
= 2 h* (ko)ei @1te2ko Px h(ky)e Je" 3 h(ky)e Jo2k2 


ko ky ky 


oe YK —kj +ko, lo —ko + koje Joi dirk +k) 6—jo2(h—ka+ko) 
ho ob 


Rearranging terms and using (12.1.16), we obtain 
RO (EI eI) a Her I) Hey AEM) RO elt -g™) (12.1.29) 
which shows that the bispectrum, in contrast to the PSD in (12.1.25), is sensitive to the 
phase of the system. 
In a similar, but more complicated way, we can show that the output trispectrum is 
given by 
(4) (,J jo3) — jortjortj ay 
RY (ef, e1O2 eI3) = H* (eI OF IO2FIO3) F* (e-J°1) Hf (e®2) 
; . (12.1.30) 
x H(e°3) RO (2/1, e/@2, e/3) 


which again shows that the trispectrum is sensitive to the phase of the system. 


12.1.3 Higher-Order Moments of Linear Signal Models 


In Section 4.1 we discussed linear signal models and the innovations representation of 
stationary processes with given second-order moments. This representation is given by 


x(n) = So h(kyw(n — k) (12.1.31) 
k=0 
ry(l) = 0%, YAAK =D (12.1.32) 
k=0 
R,(e!®) = 07 |H(e!”)| (12.1.33) 


where w(7) is a white noise process and H(z) is a minimum phase. If w(7) is Gaussian, 
x(n) is Gaussian and this representation provides a complete description of the process. 
If the excitation w(n) is IID and non-Gaussian with cumulants 


y@ hebs- Hh =0 


kM, b,...,lke-1) = (12.1.34) 
0 otherwise 


the output of the linear model is also non-Gaussian. The cumulants and the polyspectra of 
process x(n) are 


(oe) 
CPU b,-.ke1) =¥P Yoh@an + hy) ht he-1) (12.1.35) 
n=0 


and 
RO (eI, C12, eI 1) = yO HCI) H (e}2) .-. H(eF1) H* (eI Eis %) 
(12.1.36) 


respectively. The cases for k = 3, 4 follow easily from Equations (12.1.26), (12.1.28) to 
(12.1.30), and (12.1.34). A general derivation is discussed in Problem 12.2. 
Setting k = 3 into (12.1.36), we obtain 


LR) (e/1, ef%2) = Ly) — LH (efter) 4 LH (eJ*1) + LH (e/%2) (121.37) 


which shows that we can use the bispectrum of the output to determine the phase response 
of the system if the input is a non-Gaussian IID process. From (12.1.33) we see that this is 
not possible using the PSD. 


EXAMPLE 12.1.1. For 0 <a < 1 and0 < b < 1, consider the MA(2) systems 
Hypin(Z) = (1 — az~')(1 — bz!) 
max (z) = (1 — az)(1 — bz) 
Hymix(Z) = (1 — az) — bz") 


which obviously are minimum-phase, maximum-phase, and mixed-phase, respectively. All these 
systems have the same output complex PSD 


Rx(Z) = 0%, Amin (2) Hmin(Z!) = o2, Hmax(Z) Hmax(z) = 0%, Hix (2) mix (@ +) 


and hence the same autocorrelation. As a result, we cannot correctly identify the phase response 
of an MA(2) model using the PSD (or equivalently the autocorrelation) of the output signal. 
However, we can correctly identify the phase by using the bispectrum. The output third-order 
cumulant co (11, 12) for the above MA(2) models can be computed by using either the complex 
bispectrum (the z transform of the third-order cumulant) 


R® (21, 20) = ¥P A) Hea) H(-z7 1251) 
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2 
or the formula Pace (1, b) = yo? h(nyh(n +1,)h(n + ly) (12.1.38) 
n=0 
for all values of / , /2 that lead to overlapping terms in the summation. The results are summarized 
in Table 12.1. The values shown are for the principal region (see Figure 12.1); the remaining 
ones are computed using the symmetry relations (12.1.20). Using the formula 


RP) (e102) = yO Hell) H (ef) H* (ei 1+ 2) (12.1.39) 


we can numerically compute the bispectrum, using the DFT (see Problem 12.4). The results 
are plotted in Figure 12.2. We see that the cumulants and bispectra of the three systems are 
different. Hence, the third-order moments can be used to identify both the magnitude and the 
phase response of the MA(2) model. 


TABLE 12.1 

Minimum-, maximum-, and mixed-phase MA(2) systems with the same autocorrelation 
(or PSD) but with different third-order moments (0 < a < 1, 0 < b < 1) (Nikias and 
Raghuveer 1987). 


Cumulants Minimum-phase MA(2) Maximum-phase MA(2) Mixed-phase MA(2) 


«0, 0) 1- (a+b +4353 1-(a+b3 +4353 (1 +ab)3 — a3 — 33 
«P(A, 1) (a+b —(a+b)a2b2 = —(a +b) abla +b)? —a(1 +. ab)? + (1 + ab)b? 
cP, 2) ab ab —ab? 
«1, 0) —~(a +b) +ab(a + by? (a +b)? — (a + b)a2b? a2(1 +ab) — (1 +.ab)2b 
(2, 0) ab a2b —a*b 
(2,1) —(a+b)ab ~(a+b)ab ab(1 + ab) 
rx (0) 1+ ab? + (a+b) 1+ ab? + (a+b) 1+a2b2 + (a+b)? 
ry (1) —(a+b)(1 + ab) —(a+b)(1 + ab) —(a+b)(1+ab) 
ry (2) ab ab ab 

LA FIGURE 12.1 


Region of support for the 
third-order cumulant of the MA(2) 
model. The solid circles indicate 
the primary samples, which can be 
utilized to determine the remaining 
samples using symmetry relations. 


From the previous example and the general discussion of higher-order moments and 


their transformation by linear systems, we conclude that HOS can be useful when we 


deal 


with non-Gaussian signals or Gaussian signals that have passed through nonlinear 
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FIGURE 12.2 
Bispectrum magnitude and phase for minimum-, maximum-, and mixed-phase 
MA(2) models with a = 0.5 and b = 0.9. 


systems. More specifically, the use of HOS is beneficial in the following cases: suppression 
of additive Gaussian colored noise, identification of the phase response of a system using 
output data only, and characterization of non-Gaussian processes or nonlinear systems. 
More details and applications are discussed in Nikias and Mendel (1993). However, note 
that the application of HOS-based methods to real-world problems is very difficult because 
(1) the computation of reliable estimates of higher-order moments requires a large amount 
of data and (2) the assessment and interpretation of the results require a solid statistical 
backgound and extensive practical experience. 


12.2 BLIND DECONVOLUTION 


In Section 6.7, we discussed optimum inverse filtering and deconvolution using the mini- 
mum mean square error (MMSE) criterion under the assumption that all required statistical 
moments are known. In the case of blind deconvolution (see Figure 6.23), the goal is to 
retrieve the input of a system G(z) by using only the output signal and possibly some 
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w(n) 


x(n) 


statistical information about the input. The most critical requirement is that the input signal 
w(n) be IID, which is a reasonable assumption for many applications of practical interest. 
In this case, we have 


R,(e!®) = 02, |G(e/®)/ (12.2.1) 


which can be used to determine, at least in principle, the magnitude response |G(e/)| 
from the output PSD R,(e/”). In general, it is impossible to obtain the phase response of 
the system from R, (e/) without additional information. For example, if we know that 
G(z) = 1/A(z) is a minimum-phase AP(P) system, we can uniquely identify it from 
ry (J) or Ry (e/®), using the method of linear prediction. However, if the system is not 
minimum-phase, the method of linear prediction will identify it as minimum-phase, leading 
to erroneous results. 

The importance of the input probability density function in deconvolution applications 
is illustrated in the following figures. Figure 12.3 shows a random sequence generated by 
filtering white Gaussian noise with a minimum-phase system H(z) and the sequences ob- 
tained by deconvolution of this sequence with the minimum-phase, maximum-phase, and 


Gaussian 
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FIGURE 12.3 


A minimum-phase Gaussian random sequence and its deconvolution by the corresponding minimum-, 
maximum-, and mixed-phase inverse systems. 


mixed-phase inverse systems corresponding to H(z). The three deconvolved sequences, 
which look visually similar, are all uncorrelated and statistically indistinguishable, because 
in the Gaussian case uncorrelatedness implies statistical independence. Figure 12.4 shows 
the results of the same experiment repeated with the same systems and a white noise 
sequence with an exponential probability density function. It is now clear that only the 
minimum-phase inverse system provides the corect answer, although all three deconvolved 
sequences have the same second-order statistics (Donoho 1981). More details about the gen- 
eration of these figures and further discussion of their meaning are given in Problem 12.5. 


Non-Gaussian 
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A minimum-phase non-Gaussian random sequence and its deconvolution by the corresponding minimum-, 


maximum-, and mixed-phase inverse systems. 


We conclude that complete identification of G(z), and therefore correct retrieval of the 
input signal w(n), requires the identification of the phase response <G(e/”) of the system; 
failure to do so may lead to erroneous results. 

If the input w (7) is ID and non-Gaussian, the bispectrum of the output is [see (12.1.36)] 


RO (eI! C2) = KI G (EI) G (C12) G* (C11 *322) (12.2.2) 


and it can be used to determine both the magnitude and phase response of G(e/“) from the 
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magnitude and argument of the bispectrum. If the bispectrum is identically zero, we can 
use some nonzero higher-order spectrum. Therefore, HOS can be used in many ways to 
obtain unbiased estimates of 4G(e/”) provided that the polyspectra are not all identically 
zero, or equivalently the input probability density function (pdf) is non-Gaussian (Matsuoka 
and Ulrych 1984; Mendel 1991; Nikias and Mendel 1993). We emphasize that, in practice, 
polyspectra estimators have high variance, and therefore reliable phase estimation requires 
very long data records. In conclusion, blind deconvolution is always possible provided a 
stable inverse 1/G(z) exists and w(n) is non-Gaussian. If w(n) is Gaussian, we cannot 
correctly identify the phase response of the inverse system using only the second-order 
moments of the output signal. 

As we have already mentioned, MMSE linear prediction solves the blind deconvolution 
problem for minimum-phase systems with Gaussian inputs using the autocorrelation of the 
output signal. In essense, the inverse system retrieves the input by restoring its flat PSD, 
which has been colored by the system G(z). This suggests the following question: Js it 
possible to uniquely determine the inverse system h(n) by restoring some property of the 
input signal (besides spectral flatness) that has been distorted by the system G(z)? To 
address this question, let us consider the effects of an LTI system upon the probability 
density function of the input signal. We recall that 


e If the input pdf is Gaussian, then the output pdf is Gaussian. In general, if the input pdf 
is stable, then the output pdf is also stable. This follows from the fact that only stable 
random variables are invariant under linear transformations (see Section 3.2.4). However, 
we limit our discussion to Gaussian signals because they have finite variance. 

e If the input pdf is non-Gaussian, then the output pdf tends to Gaussian as a result of 
the central limit theorem (see Section 3.3.7). The “Gaussianization” capability of the 
system depends on the length and amplitude of its impulse response." This is illustrated 
in Example 3.2.4, which shows that the sum of uniform random variables becomes “more 
Gaussian” as their number increases. 


We see that filtering of a non-Gaussian IID sequence increases its Gaussianity. The only 
system that does not alter a non-Gaussian input pdf has impulse response with one nonzero 
sample, that is, bod(n — no). In any other case, the input and output distributions are dif- 
ferent, except if the input is Gaussian. A strict proof is provided by the following theorem 
(Kagan et al. 1973). 


THEOREM 12.1. Consider a random variable x defined by the linear combination of ITD random 
variables wz 


x=) > cqwg (12.2.3) 
k 
with coefficients such that )7;, |cg |? < oo. The random variable x is Gaussian if and only if (a) x 


7 ei d : 
has finite variance, (b) x = wy, forall k, (c) at least two coefficients c, are not zero. 


If we define the overall system (see Figure 12.5) 
c(n) = g(n) * h(n) (12.2.4) 


the signals y(7) and w(n) can have the same non-Gaussian distribution if and only if c(7) 
has only one nonzero coefficient. Hence, if we know the input pdf, we can determine the 
inverse system h(n) by restoring the pdf of y(7) to match the pdf of the input w(n). However, 
it turns out that instead of restoring the pdf, that is, all moments (Benveniste et al. 1980), 
we only need to restore the moments up to order 4 (Shalvi and Weinstein 1990). This is 
shown in the following theorem. 


In many practical applications (e.g., seismology), the underlying data are non-Gaussian; however, unavoidable 
filtering operations (e.g., recording instruments) tend to “Gaussianize” their distribution. As a result, many times 
the non-Gaussianity of the data becomes apparent after proper deconvolution (Donoho 1981). 


Unknown 
input 


Unknown Deconvolution 
system filter 


FIGURE 12.5 
Basic blind deconvolution model. 


THEOREM 12.2. Consider a stable LTI system 


y(n) = Do c(kyw(n — k) (12.2.5) 
with an IID input w(n) that has finite nee up to order 4. Then we have 
E{ly@)|*} = Ef{lwn)I7} 9) le? (12.2.6) 
k 
E{y?(n)} = E{w>(n)} )> c7(k) (12.2.7) 
k 
and cM = KD lett (12.2.8) 
k 
where «S = Elly} — 2E7{(ly@)I7} — [Ely (12.2.9) 


is the fourth-order cumulant of y(n) and Pang is the fourth-order cumulant of w(n). 

Proof. Relations (12.2.6) and (12.2.7) can be easily shown by using (12.2.5) and the indepen- 
dence assumption. To prove (12.2.8), we start with (12.2.5); then by interchanging the order 
between expectation and summations, we have 


2 
E{ly(a)4} = E 4] cua — bk) 
k 
= VU DV ckkiye* Kadelks)c* (ka) (12.2.10) 
ky ko ky kg 


x Efw(n — ky)w*(n — ko) w(n — k3)w* (n — kg)} 
a 
E{wm)*} ky =kn= kg =k 
E*{\wn)?} ky = ko Akg = ky ky = hy Akg =k 


where W= P (12.2.11) 
|E{w*(n)}| ky = kg Fkg = kg 


0 otherwise 


by invoking the independence assumption. If we substitute (12.2.11) into (12.2.10), we obtain 


Efly()*} = Eflw[4} le)? 
k 


2 
2 
+ 2E7{|w(n)|7} Err) ~ |e) 4 (12.2.12) 
k 


2 
+|E{wy}? | [So e7)) — Yo leaw* 
k k 


Finally, substituting (12.2.6) and (12.2.7) into (12.2.12) and rearranging the various terms, we 
obtain (12.2.8). 
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We now use the previous theorem to derive necessary and sufficient conditions for 
blind deconvolution (Shalvi and Weinstein 1990). 


THEOREM 12.3. Consider the blind deconvolution model shown in Figure 12.5 where c(n) = 
gin) * h(n). If E{|y(n)|?} = E{|w()/*}, then 


1. | < rans a that is, the kurtosis of the output is less than or equal to the kurtosis of the 


input. 
2. | = | if and only if c(n) = 9 §(n — ng). Hence, if the kurtosis of the output is 


equal to the kurtosis of the input, the inverse system is given by H(z) = e/9z—"0/G(z). 


Proof. The proof can be easily obtained by using the inequality 
2 


Yiledolt < | Soleo? (12.2.13) 
k k 


where equality holds if and only if c(k) has at most one nonzero component. The condition 
E{|y(n)|*} = E{|w(n)|?} in conjunction with (12.2.6) implies that 17; |c(k)|* = 1. Therefore, 
Dy leG4 < Land |e | < |x {P| due to (12.2.8). Clearly, if 7, Ick)? = 1, we can have 
Y; le()|* = 1 if and only if c(2) = e/?5(n — no). 


This theorem shows that a necessary and sufficient condition for the correct recovery 
of the inverse system /(n), that is, for successful blind deconvolution, is that E {| y(n)|7} = 
E{| w(n)|7} and taal = rang |. Therefore, we can determine h(n) by solving the following 
constrained optimization problem: 

max || subjectto = Ef|y(n)|?} = E{\w(n)|} (12.2.14) 
n 
It has been shown that for FIR inverse filters of sufficient length, ra has no spurious local 
maxima over E{| y(n) |7} = E{|w(n) 7}, and therefore gradient search algorithms converge 
to the correct solution regardless of initialization (Shalvi and Weinstein 1990). We should 
stress that the IID property of the input w(7) is a key requirement for blind deconvolution 
methods to work. 

By using the normalized cumulants Pa = Ke d / on it has been shown for real signals 

(Donoho 1981) that 
Yi lec" 


eO = gO _* __ (12.2.15) 


2 
bp nr 
k 
(4) 


which implies that lic < |ky |, a result attributed to Granger (1976). Furthermore, 
Donoho (1981) showed that if ch # 0, then maximization of | provides a solution 
to the blind deconvolution problem (Tugnait 1992). An elaborate discussion of cumulant 
maximization criteria and algorithms for blind deconvolution is given in Cadzow (1996). 
A review of various approaches for blind system identification and deconvolution is given 
in Abed-Meraim et al. (1997). In the next section, we apply these results to the design of 
adaptive filters for blind equalization. 


12.3 UNSUPERVISED ADAPTIVE FILTERS—BLIND EQUALIZERS 
All the adaptive filters we have discussed so far require the availability of a desired response 


signal that is used to “supervise” their operation. What we mean by that is that, at each 
time instant, the adaptive filter compares its output with the desired response and uses this 


information to improve its performance. In this sense, the desired response serves as a 
training signal that provides the feedback needed by the filter to improve its performance. 
However, as we discussed in Sections 1.4.1 and 10.1, there are applications such as blind 
equalization and blind deconvolution in which the availability of a desired response signal 
is either impossible or inpractical. In this section we discuss adaptive filters that circumvent 
this problem; that is, they can operate without a desired response signal. These filters are 
called unsupervised adaptive filters to signify the fact that they operate without “super- 
vision,” that is, without a desired response signal. Clearly, unsupervised adaptive filters 
need additional information to make up for the lack of a desired response signal. This in- 
formation depends on the particular application and has a big influence on the design and 
performance of the adaptive algorithm. The most widely used unsupervised adaptive filters 
are application-specific and operate by exploiting (1) the higher-order statistics, (2) the 
cyclostationary statistics, or (3) some invariant property of the input signal. Most unsuper- 
vised adaptive filtering algorithms have been developed in the context of blind equalization, 
which provides the most important practical application of these filters. 


v(n) 


LTI 


channel : 
Channel h(n) filter 


input 


Equalizer Decision 
device 


mode 


Adaptive 
algorithm 


a Training 


n-Ng 


mode 


Training 


sequence 


FIGURE 12.6 
Conventional channel equalizer with training and decision-directed modes of operation. 


12.3.1 Blind Equalization 


Figure 12.6 shows the traditional approach to adaptive equalization.’ When the adaptive 
equalizer starts its operation, the transmitter sends a known training sequence over the 
unknown channel. Since the training sequence can be used as a desired response signal, we 
can adjust the equalizer’s coefficients by using the standard LMS or RLS algorithms. The 
LMS equalization algorithm with a training sequence is 


c(n) = e(n — 1) + 2yx(n)e*(n) (12.3.1) 

where e(n) = An—ny — YN) = An—no — cA (n — 1)x(n) (12.3.2) 

is the a priori error. If, at the end of the training period, the MSE E{|e(n) 7} is so small that 

J(n) X Gn—no, then we can replace dy—ny by the decision dy—ny £ Q[5(n)] and switch the 
equalizer to decision-directed mode. The resulting algorithm is 

e(n) = e(n — 1) + 2uex(n){QlH(~M)] — F)}* (12.3.3) 


and its performance depends on how close c() is to the optimum setting ¢, according to the 


"This approach has been discussed in Sections 6.8 and 10.4.4. 


Decision- 
directed 
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MSE or the zero-forcing criterion. If c(0) is close to ¢,, the intersymbol interference (ISI) 
is significantly reduced (i.e., the eye is open), the decision device makes correct decisions 
with low probability of error, and the algorithm is likely to converge to ¢,. However, if c(0) 
is not close to cy, that is, when the eye is closed (which is when we need an equalizer), then 
the error surface can be multimodal and the decision-directed equalizer fails to converge or 
converges to a local minimum (Mazo 1980). 

The training session should be repeated each time the channel response changes or after 
system breakdowns, which results in a reduction of the data throughput. However, there 
are digital communications applications in which the start-up and retraining of the adaptive 
equalizer have to be accomplished without a training sequence. Adaptive equalizers that op- 
erate without the aid of a training signal are known as blind equalizers, although the term un- 
supervised would be more appropriate. The need for blind equalization is enormous in digital 
point-to-multipoint and broadcast networks, such as high-definition and cable television. 
In all these applications, the transmitter should be able to send its content unaffected by the 
joining or withdrawal of client receivers or their need for training data (Treichler et al. 1998). 

Clearly, blind equalization is a special case of blind deconvolution with input from a 
finite alphabet. When we deal with blind equalization, we should recall the following facts: 


1. The second-order statistics of the output provide information about the magnitude re- 
sponse of an LTI channel. Therefore, mixed-phase channels cannot be identified using 
second-order statistics only. 

2. Mixed-phase LTI channels with ID Gaussian inputs cannot be identified from their 
output because all statistical information is contained in the second-order moments. 

3. The inverse of a mixed-phase LTI channel is IIR and unstable. Hence, only an FIR causal 
approximation can be used for its equalization. 

4. Channels with zeros on the unit circle cannot be equalized by using zero-forcing equal- 
izers (Section 6.8). 

5. Since|H(e/®)|? = |H(e/®)e/° |? and for perfect equalization H (z)C(z) = boz~"", bo 4 
0, the channel can be identified up to a rotational factor and a constant time shift. 

6. The structure of the finite symbol alphabet improves the detection process, which can 
be thought as an unsupervised pattern classification problem (Fukunaga 1990). 


All equalizers (blind or not blind) use the second-order statistics (autocorrelation or power 
spectrum) of the channel output, to obtain information about the channel’s magnitude re- 
sponse. However, blind equalizers need additional information to determine the phase re- 
sponse of the channel and to compensate for the absense of the desired response sequence. 
Phase information can be obtained from the HOS or the second-order and higher-order 
cyclostationary moments of the channel output. The cyclostationarity property results from 
the modulation of the transmitted signal (Gardner 1991). 

The above types of information can be exploited, either individually or in combination, 
to obtain various blind equalization algorithms. The available blind equalization methods 
can be categorized into two groups: 


1. HOS-based methods. These can be further divided into two groups: 


a. Implicit HOS algorithms implicitly explore HOS by iteratively minimizing a non- 
MSE criterion, which does not require the desired response but reflects the amount 
of residual ISI in the received signal. 

b. Explicit HOS algorithms compute explicitly the block estimates of the power spec- 
trum to determine the magnitude response and block estimates of the trispectrum, 
to determine the phase response of the channel. 


2. Cyclostationary statistics—-based methods, which exploit the second-order cyclostation- 
ary statistics of the received signal. 


Since the number of samples required to estimate the mth-order moment, for a given level 
of bias and variance, increases almost exponentially with order m (Brillinger 1980), both 
implicit and explicit HOS-based methods have a slow rate of convergence. Indeed, since 
channel identification requires at least fourth-order moments, HOS-based algorithms require 
a large number, typically several thousand, of data samples (Ding 1994). 

Explicit HOS methods originated in geophysics to solve blind deconvolution problems 
with non-Gaussian inputs (Wiggins 1978; Donoho 1981; Godfrey and Rocca 1981). A 
complete discussion of the application of HOS techniques to blind equalization is given 
in Hatzinakos and Nikias (1991). Because HOS algorithms require a large number of data 
samples and have high computational complexity, they are not used in practice for blind 
equalization applications. In contrast to symbol rate blind equalizers that require the use of 
HOS, the input of fractionally spaced equalizers (which is sampled higher than the symbol 
rate) contains additional cyclostationarity-based second-order statistics (SOS) that can be 
exploited to identify the channel. Since SOS requires fewer data samples for estimation, 
we can exploit cyclostationarity to obtain algorithms that converge faster than HOS-based 
algorithms. Furthermore, channel identification using cyclic SOS does not preclude inputs 
with Gaussian or nearly Gaussian statistics. More information about these methods can be 
found in Gardner (1991), Ding (1994), Tong et al. (1994a, b), and Moulines et al. (1995). 
We focus on implicit HOS methods because they are easy to implement and are widely used 
in practice. 


12.3.2 Symbol Rate Blind Equalizers 
The basic structure of a blind equalization system is shown in Figure 12.7. The key element 


is a scalar zero-memory nonlinear function w, which serves to generate a desired response 
signal w[¥(n)] for the adaptive algorithm. 


v(n) 


Equalizer Decision 
Channel filter device 
input 


Adaptive 
algorithm 
FIGURE 12.7 


Basic elements of an adaptive blind equalization system. 


We wish to find the function y that provides a good estimate of the desired response 
dn. To this end, suppose that we have a good initial guess c(n) of the equalizer coefficients. 
Then we assume that the convolution of the channel and equalizer impulse responses can 
be decomposed as 


h(n) * c(n) = 8(n) + hysy(n) (12.3.4) 
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where /ysj(7) is the component creating the ISI. The output of the equalizer is 
3(n) = c(n) * x(n) = c(n) * [A(n) * an + (2) 
= dy + hysi(n) * ay + (0) * V(N) = ay + O(n) 

where /yjs1(7) * a, is the residual ISI and c(n) * v(m) is additive noise. By invoking the 
central limit theorem, we can show that the convolutional noise v(m) can be modeled as 
white Gaussian noise (Godfrey and Rocca 1981; Haykin 1996). Since a, is IID and since 
dn and v(n) are statistically independent, the minimum MSE estimate z(n) of a, based on 
y(n) is 


(12.3.5) 


2(n) = Efan|5(n)} = WLF(n)] (12.3.6) 


which is a nonlinear function of }(7) because a, has a non-Gaussian distribution. Then the 
a priori error is 

e(n) = WIS] — $@) (12.3.7) 

L 
where y(n) = > ck(n — 1)x(n—k) & ce” (n — 1)x(n) (12.3.8) 
k=—-L 

is the output of the equalizer. This leads to the following a priori stochastic gradient algorithm 
for blind equalization 


c(n) = e(n — 1) + px(n)e*(n) (12.3.9) 


where ju is the adaptation step size. 
Another approach used to derive (12.3.9) is to start with the cost function 


P(n) = E{W[$(n)]} (12.3.10) 
where wv(y) 4 wy) -y (12.3.11) 
is the derivative wy) & W (y) — i (12.3.12) 


of a nonlinear function W. The nonlinearity of Y creates the dependence of the cost function 
on the HOS of $(n) and a,,. The cost function (12.3.10) should not require the input sequence 
dn; it should reflect the amount of current ISI, and its minimum should correspond to the 
minimum ISI or minimum MSE condition. In contrast to the MSE criterion, which depends 
on the SOS and is a quadratic (convex) function of the equalizer parameters, the cost 
function (12.3.10) is nonconvex and may have local minima. If we compute the gradient of 
P(n) with respect to c and drop the expectation operation, we obtain the stochastic gradient 
algorithm (12.3.9). 

Equations (12.3.8), (12.3.7), and (12.3.9) provide the general form of LMS-type blind 
equalization algorithms. Different choices for the nonlinear function y result in various 
algorithms for blind equalization. Because the output $(n) is approximately a Bussgang 
process, these algorithms are sometimes called Bussgang algorithms for blind equalization 
(Haykin 1996). A process is called Bussgang (Bussgang 1952; Bellini 1986) if it satisfies 
the property 


E{$(n)5* (2 —D} = ESV (2 — DI (12.3.13) 
that is, its autocorrelation is equal to the cross-correlation between the process and a non- 


linear transformation of the process. 


Sato algorithm. The first blind equalizer was introduced by Sato (1975) for one- 
dimensional multilevel pulse amplitude modulation (PAM) signals. It uses the error function 


Wi(n) = Ri sgn[P(m)] — 9) = ef) (12.3.14) 


A E{\a,|7} 


where Ri = (12.3.15) 
E{|an\} 
and sgn(x) is the signum function. Integration of y(n) gives 
WiLH(n)] = $1Ri — Sa) P (12.3.16) 


whose expectation provides the cost function for the Sato algorithm. The complex version 
of the algorithm, used for quadrature amplitude modulation (QAM) constellations, uses the 
error 


e(n) = Ri csgn[¥(n)] — y(n) (12.3.17) 
where csgn(x) = csgn(x; + jx) = sgn(x,) +j sgn(j) (12.3.18) 


is the complex signum function. 


Godard algorithms. The most widely used algorithms, in practical blind equalization 
applications, were developed by Godard (1980) for QAM signal constellations. Godard 
replaced the function Y; with the more general function 


Wp[V(n)] = “tr — |9(n)|P}? (12.3.19) 
P 2p P 


where p is a positive integer and R, is the positive real constant 


E{la,|?? 
Ry & Ellgnl (12.3.20) 
E{\an|?} 
which is known as the dispersion of order p. The family of Godard stochastic gradient 
algorithms is described by 


c(n) = e(n — 1) + px(n)e*(n) (12.3.21) 
where e(n) = F(n)|F(n)|/P7[Rp — |9(n)|?] (12.3.22) 


is the error signal. This is an LMS-type algorithm obtained by computing the gradient of 
(12.3.19) and dropping the expectation operator. 

Other algorithms for blind equalization include (Ding 1998) the extensions of the Sato 
algorithm in Benveniste et al. (1980), the stop-and-go algorithms (Picchi and Prati 1987), 
and the Shalvi and Weinstein algorithms (Shalvi and Weinstein 1990). 


12.3.3 Constant-Modulus Algorithm 


The Godard algorithm for p = 2 was independently introduced by Treichler and Agee 
(1983) with the name constant-modulus algorithm (CMA) and used the property restoral 
approach. The resulting cost function 


P(n) = E{[R2 — |S) 77} (12.3.23) 


depends on the amount of ISI plus noise at the output of the equalizer. Godard (1980) 
has shown that the coefficient values that minimize (12.3.23) are close to the values that 
minimize’ the MSE E {Lan |? —|3(n) \7]?}. The criterion is independent of the carrier phase 
because if we replace y(n) by p(njel? in (12.3.23), then P(m) remains unchanged. As a 
result, the adaptation of the CMA can take place independently of and simultaneously with 


1 More precisely, we wish to minimize F {[|@n—ng 2 —_ [3(n)|712} for a particular choice of the delay ng. As we 
have seen in Section 6.8, the value of ng has a critical effect on the performance of the equalizer. 
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operation of the carrier recovery system. The CMA is summarized in Table 12.2. Note that 
for 128-QAM, Ro = 110. If we choose R2 4 110, the CMA converges to a linearly scaled 
128-QAM constellation that satisfies (12.3.23). However, choosing an unreasonable value 
for R2 may cause problems when we switch to decision-directed mode (Gitlin et al. 1992). 


TABLE 12.2 
Summary of Godard or constant-modulus 
algorithm. 
Operation Equation 
L 
Equalizer In) = x CE (n — 1)x(n —k) 
k=-L 
Error e(n) = S(n)[Ro — |$() 17] 
Updating e(n) = e(n — 1) + wx(n)e* (n) 
E 4 
Godard constant Ros eee 
E{\a(n)|-} 


Because of its practical success and its computational simplicity, the CMA is widely 
used in blind equalization and blind array signal processing systems. 

The CMA in Table 12.2 performs a stochastic gradient minimization of the constant- 
modulus performance surface (12.3.1). In contrast to the unimodal MSE performance sur- 
face of trained equalizers, the constant-modulus performance surface of blind equalizers is 
multimodal. The multimodality of the error surface and the lack of a desired response signal 
have profound effects on the convergence properties of the CMA (Johnson et al. 1998). A 
detailed analysis of the local convergence of the CMA algorithm is provided in Ding et 
al. (1991). 


1. Initialization. Since the CMA error surface is nonconvex, the algorithm may converge 
to undesirable minima, which indicates the importance of the initialization procedure. 
In practice, almost all blind equalizers are initialized using the tap-centering approach: 
All coefficients are set to zero except for the center (reference) coefficient, which is set 
larger than a certain constant. 

2. Convergence rate. The trained LMS algorithm has a bounded convergence rate (1 — 
py ee a <t<(- Diy because the Hessian matrix (which determines 
the curvature) of the quadratic error surface is constant. Since the error surface of the 
constant-modulus criterion is multimodal and includes saddle points, the convergence 
rate of the CMA is slow at the neighborhood of saddle points and comparable to that of 
the trained LMS in the neighborhood of a local minimum. 

3. Excess MSE. In the trained LMS algorithm, the excess MSE is determined by the step 
size, attainable MMSE, number of filter coefficients, and power of the input signal. 
In addition, the excess MSE of the CMA depends on the kurtosis of the source signal 
(Fijalkow et al. 1998). 


EXAMPLE 12.3.1. To illustrate the key characteristics of the adaptive blind symbol or baud- 
spaced equalizer (BSE) using the CMA algorithm, we used BERGULATOR, a public-domain 
interactive MATLAB-5 program that allows experimentation with the constant-modulus criterion 
and various implementations of the CMA (Schniter 1998). The system function of the channel 
is H(z) = 1+0.5z7!; the input is an IID sequence with four equispaced levels (PAM); the 
SNR = 50 dB; the equalizer has two coefficients cg and c,; and the step size of the CMA is 
jt = 0.005. Figure 12.8 shows contours of the constant-modulus criterion surface in the equalizer 
coefficient space, where the location of the MMSE is indicated by the asterisk* and the local 


MSE locations by x. Since the constant-modulus surface is multimodal, the equalizer converges 
at a different minimum depending on the initial starting point. This is illustrated by the two 
different coefficient trajectories shown in Figure 12.8, which demonstrates the importance of 
initialization in adaptive algorithms with nonquadratic cost functions. Figure 12.9 shows the 
learning curves for smoothed versions of the error and the square of the error for the trajectories 
in Figure 12.8. 


CMA error surface 
T T T T 


T 
| MSE ellipse axes 


FIGURE 12.8 
Contours of the constant-modulus cost function and coefficient trajectories for a blind BSE 
using the CMA. 


12.4 FRACTIONALLY SPACED EQUALIZERS 


The input to a fractionally spaced equalizer (FSE) (see Figure 12.10) is obtained by sam- 
pling the channel output at a rate faster than the symbol or baud rate Rg = 1/Tg, where 
Tz is the symbol duration. For simplicity and because they are extensively used in practice, 
we focus on Tg /2 spaced FSE. However, all results can be extended to any rational fraction 
of Tz. One of the most attractive features of an FSE is that under ideal conditions, a finite 
impulse response (FIR) FSE can perfectly equalize an FIR channel (Johnson et al. 1998). 
Referring’ to Figure 6.26 (a), we see that the continuous-time output of the channel is 
[o,@) 
EQ) = Yo aghp(t —kTg — t) +00) (12.4.1) 


k=—00 


where /i,(t) is the continuous-time impulse response and where we have incorporated the 
channel delay fo in h,(t). The discrete-time model of Figure 6.30 is no longer valid since 


The material in this section requires familiarity with the notation and concepts developed in Section 6.8. 
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Learning curves for a blind BSE using the CMA for the two coefficient trajectories in 
Figure 12.8. 


From x(t) 
receiving BaD 


filter t=nT 


FIGURE 12.10 
Block diagram of data communications receiver with a fractionally spaced equalizer. 


T = Tg/2. However, if we extend the development leading to Figure 6.30 for t = nTp/2, 
we obtain the discrete-time signal 


x(n) =) > agh,(n — 2k) + v(n) (12.4.2) 
k=0 


where h,(”) is the equivalent discrete-time impulse response and v(7) is the equivalent 
white Gaussian noise (WGN). The output of an FIR Tg /2 spaced FSE is 


2M-1 


yen) = D> cex —&) (12.4.3) 


k=0 


where we have chosen the even-order 2M for simplicity. If we decimate the output of the 


equalizer by retaining the odd-indexed samples 2n + 1, we have 711 
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2M—1 Fractionally Spaced 
5(n) © ye2n+1)= S> cexQn+1—k) Equalizers 
k=0 
M-1 M-1 
= \) cxx(2n +1 = 2k) + Y > con pix(2n = 2k) 
k=0 k=0 
M-1 M-1 
or F(n) = YO cfx'(n—k) + Yo chx8(n— (12.4.4) 
k=0 k=0 


where Cy = Cox C= Ot x°(n) = x(2n) x°(n) =x(Q2n+1) (12.4.5) 
are known as the even (e) and odd (0) parts of the equalizer impulse responses and the 
received sequences, respectively. Equation (12.4.4) expresses the decimated symbol rate 
output of the equalizer as the sum of two symbol rate convolutions involving the even and 
odd two-channel subequalizers. 

If we define the even and odd symbol rate subchannels 


A°(n) =h,(2n) and —sh°(n) = h, (2n +1) (12.4.6) 


we can show that the combined impulse response h(n) from the transmitted symbols a, to 
the symbol rate output $(n) of the FSE is given by 


h(n) = c& *h°(n) + c8 * h°(n) (12.4.7) 
in the time domain or 
H(z) = C°(z)H°(z) + C°(z) H*(z) (12.4.8) 


in the z domain. The resulting two-channel system model is illustrated in Figure 12.11. 


v°(n) 


Channel Equalizer 


FIGURE 12.11 
Two-channel representation of a Tp/2 spaced equalizer. 
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h°(0) 0 s+ 0 
h®(1) h°(0) 
he(1) ee 
H. =} n&(L—1) : " ne(O) (12.4.9) 
0 he(L—1) 7°. A®(l) 
0 tee 0 ASL—-1) 
the even subequalizer vector 
CoS [eee ea) (12.4.10) 
and their counterparts H, and co, we can express the convolution equation (12.4.7) in matrix 
form as 
h= He (12.4.11) 
where H = [H, Ho] ce | (12.4.12) 


andh & [h (0) h() tee h(M +L—1)]’ is the symbol-spaced overall system response. In 
the absence of noise, the system is free of ISI if h is equal to 


§no = [0 ---010--- oy" (12.4.13) 


where 19, 0 < no < M+L — 1, indicates the location of the nonzero coefficient. Equiva- 
lently, the z domain zero-ISI condition from (12.4.8) is given by 


z M0 = A(z) = C®(z) H(z) + CZ) H(z) (12.4.14) 


The zero-forcing FIR equalizer is specified by the system of linear equations He = dnp, 
which has a solution if H is full row rank. This condition is also known as strong perfect 
equalization, and it holds if the number of columns is equal to or larger than the number of 
rows, that is, if2M > M+L—1orM > L —1. Furthermore, the 7, /2 spaced full-rank 
condition implies that the system functions H,(z) and H,(z) have no common roots. These 
topics are discussed in detail in Johnson et al. (1998). 

The main advantage of the zero-forcing FSE over the corresponding synchronous equal- 
izer is that, in the absence of noise, a zero-ISI elimination is possible using a finite-order 
FSE. In the case of the synchronous equalizer, a similar zero-ISI elimination is possible 
only when the equalizer is of infinite length. 


12.4.2 MMSE Fractionally Spaced Equalizers 


When the channel noise v(n) is present, then perfect equalization, even for an FSE, is not 
possible. Hence, the emphasis shifts to the best possible compromise between ISI and noise 
amplification (which is present in a zero-forcing equalizer) in a minimum MSE sense. This 
is obtained by minimizing the mean square value of the data symbol error 


e(n) = 3(N) — an—ng (12.4.15) 


for a particular choice of delay no. To obtain an expression for }(n) using the vector h in 
(12.4.11), we first define 


By = [dy Gund: oh Goel (12.4.16) 


and 


v(n) = [v(n — 1) v(n — 3) --- vn —2L 4+ 1) v(n) v(n — 2) --- v(n—2L +2)" 
(12.4.17) 


where the samples of the noise sequence are arranged as odd samples followed by the even 
samples so as to be consistent with the definitions of H and c. We then substitute (12.4.2) 
into (12.4.4) and obtain 


$(n) =a’ He + v! (n)e (12.4.18) 


Using 6n,. in (12.4.13), we see the desired symbol ay,—n, is equal to ar dno: Hence from 
(12.4.15) and (12.4.18), the symbol error is 


e(n) =a! (He — 6) + Vv’ (ne (12.4.19) 


Assuming that the symbol sequence {a,,} is ID with variance ae and is uncorrelated 
with the noise sequence v(n) ~ WN(O, @4): the mean square value of the error e(n) is 
given by 


MSE(e, no) = Ef{le(n)|?} = 02 (He — 6,,)" (He — 6,,,) + o2e%e (12.4.20) 


which is a function of two minimizing parameters ¢ and no. Following our development 
in Section 6.2 on linear MSE estimation, the equalizer coefficient vector that minimizes 
(12.4.20) is given by 
ot \! 
é= ("nt Ete “+1) H" 5, (12.4.21) 
a 

which is the classical Wiener filter. Also compare (12.4.21) with the frequency-domain 
Wiener filter given in (6.8.29). The corresponding minimum MSE with respect to ¢ is given 
by 


2 -1 
min MSE (¢, no) = MSE(n0) = 4), [ -H (u" H+ “1) H? 8ny  (12.4.22) 
é OG 


Finally, the optimum value of no is obtained by determining the index of the minimum 
diagonal element of the matrix in square brackets in (12.4.22), that is, 


2 —1 
fio = arg min | [ _H (wn za 1) u" | | (12.4.23) 
no OG 
no,no 


Once again, similar to the synchronous equalizer, the MMSE fractionally spaced equal- 
izer is more robust to both the channel noise and the large amount of ISI. Additionally, it 
provides insensitivity to sampling phase and an ability to function as a matched filter in 
the presence of severe noise. Therefore, in practice, FSEs are preferred to synchronous 
equalizers. 


12.4.3 Blind Fractionally Spaced Equalizers 


Fractionally spaced equalizers have just about dominated practical equalization applications 
because they are insensitive to sampling phase, they can function as matched filters, they can 
compensate severe band-edge delay distortion, they provide reduced noise enhancement, 
and they can perfectly equalize an FIR channel under ideal conditions (Gitlin et al. 1992; 
Johnson et al. 1998). 

The CMA for an FSE is given by 


M-1 M-1 


y(n) = 2 cy(n — 1)x°(n —k) + a ce(n — 1)x°(n —k) 4 ¢7(n—1)x(n)_—(12..4.24) 
k=0 k=0 
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where c(m — 1) and x(7) are concatenated even and odd sample vectors. The blind FSE 
adaptive structure is shown in Figure 12.12. The value of R2 depends on the input symbol 
constellation. This algorithm and its convergence are discussed in Johnson et al. (1998). 


x(t) Decision 
Received device 
signal 2 
FIGURE 12.12 


Basic elements of an FS adaptive blind equalization system. 


EXAMPLE 12.4.1. To illustrate the superiority of the blind FSE over the blind BSE, we have 
used the BERGULATOR to simulate a 16-QAM data transmission system. The channel system 
function is H(z) = 0.2 + 0.5z~! + z~? — 0.1z73, the SNR = 20dB, and the equalizer has 
M = 8 coefficients. Figure 12.13 shows the constellation of the received signal at the input of 
the equalizer, where it is clear that the combined effect of ISI and noise makes detection extremely 


Constellation diagram: received data 
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FIGURE 12.13 


Constellation of the received signal symbols at the input of the 
equalizer. 


difficult, if not impossible. Figures 12.14 and 12.15 show the symbol constellations at the output 
of a BSE and an FSE, respectively. We can easily see that the FSE is able to significantly remove 
ISI. Figure 12.16 shows the learning curves for the blind adaptive FSE using the CMA. 


Constellation diagram (last 20% of output data) 
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FIGURE 12.14 
Constellation of the equalized signal symbols at the output of the 


BSE equalizer. 


Constellation diagram (last 20% of output data) 
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FIGURE 12.15 


Constellation of the equalized signal symbols at the output of the FSE. 
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FIGURE 12.16 
Learning curves for the blind FSE adaptive equalizer using the CMA. 


12.5 FRACTIONAL POLE-ZERO SIGNAL MODELS 


In this section we show how to obtain models with hyperbolically decaying autocorrelation, 
and hence long memory, by introducing fractional poles at zero frequency (fractional pole 
models) or nonzero frequency (harmonic fractional pole models). Cascading fractional with 
rational models results in mixed-memory models, known as fractional pole-zero models. We 
explore the properties of both types of models and introduce techniques for their practical 
implementation. Special emphasis is placed on the generation of discrete fractional pole 
noise, which is the random process generated by exciting a fractional pole model with 
white Gaussian noise. We conclude with a brief introduction to pole-zero and fractional 
pole models with SaS HD inputs, which result in processes with high variability and short 
or long memory, respectively. Fractional models are widely used in areas such as hydrology, 
data network traffic analysis, heart rate analysis, and economics. 


12.5.1 Fractional Unit-Pole Model 


The impulse response and the autocorrelation sequence of a pole-zero model decay expo- 
nentially with time, that is, they are geometrically bounded as 


k@l<Cng" = [pOl< Cpe“ (12.5.1) 


where Cy, Cp, > 0 and 0 < ¢ < 1 (see Chapter 4). To get a long impulse response or a 
long autocorrelation, at least one of the poles should move very close to the unit circle. 
However, in many applications we need models whose autocorrelation decays more slowly 


than ¢~! as 1 —> 00, that is, models with long memory (see Section 3.2.4). In this section, 
we introduce a class of models, known as fractional pole models, whose autocorrelation 
asymptotically exhibits a geometric decay. 

We have seen in Chapter 4 that by restricting some “integral” poles to being on the unit 
circle, we obtain models that are useful in modeling some types of nonstationary behavior. 
The fractional pole model FP(d) was introduced in Granger and Joyeux (1980) and Hosking 
(1981), and is defined by 


foe) , 1 
Ha(z) = ) ha(k)z* & ra (12.5.2) 
k=0 


where d is a nonintegral, that is, a fractional parameter. See Figure 12.17. 


x(n) 


Discrete-time 
fractional noise 


w(n) 


White noise 


FIGURE 12.17 
Block diagram representation of the discrete-time fractional 
noise model. 


The characteristics of the model depend on the value of parameter d. Since d is not 
an integer, Hq(z) is not a rational function. It is the nonrationality that gives this model its 
long-memory properties. Although we can approximate a fractional model by a PZ(P, Q) 
model, the orders P and Q that are needed to obtain a good approximation can be very large. 
This makes the estimation of pole-zero model parameters very difficult, and in practice it 
is better to use an FP(d) model. 


Impulse response. To obtain the impulse response of the fractional pole model, we 
expand the system function Hg(z) = (1 — z~!)~@ in a power series using the binomial 
series expansion. This gives 


1 a QE SD) 
Ag(z) = ——., = 1+ dz eee ceel? Phe 


12:5;3 
(1 —z7!)4 2! ( ) 
The impulse response is given by 
d(d—1)---(d —1 d — 1)! Tr d 
iG ee (12.5.4) 
n! n\(d — 1)! Ta+)rq@d) 
forn > O and hg(n) = 0 forn < 0. P'(-) is the gamma function defined as 
pe et az a>0O 
A 
T'(a)= sé ee (12.5.5) 
a! Td +a) a <0 


with "(a + 1) = al (q@) for any @ and P(n + 1) = n! for n an integer. Note that hg(n) can 
be easily computed by using the recursion 


din—i 
ha(n) = i hae weds: (12.5.6) 


with hg(0) = 1. 
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The system function of the inverse model is 


= 1 
H(z) 2h mee hd 12.5.7 
1(Z) »X 1(n)z aor (12.5.7) 
Hence Fete Sot Lele Re a 2 (12.5.8) 


ni(-d—1)! Pin + 1)P(—d) 
As expected, h;(n) is obtained from h(n) by simply replacing d by —d. 


Minimum-phase. To understand the behavior of the model, we look at the impulse 
response as n — oo. Using Sterling’s approximation (Abramowitz and Stegun 1970) 
d—1)! 
(n+ Mise nfo! 


asin — oo (12.5.9) 
n! 


1 
we have h(n) ~ ao asin — oO (12.5.10) 


As aresult of this geometric decay, the sum }°7° 9 |A()| does not exist ford > 0. Therefore, 
the system is not BIBO stable. However, if d < x, the sum ead, h?(n) < oo, and the 
input w(n) has finite variance, then the output of the system 


x(n) = )~ ha(kyw(n — k) (12.5.11) 


k=0 


exists in the mean square sense. In a similar way, the output of the inverse system exists in 
mean square ifd > — ‘. In view of this mean square convergence, we say that the fractional 


pole model is minimum-phase if -} <d< on even if hg(n) does not converge absolutely. 


Spectrum. The complex power spectrum of the model is Ry (z) = o7,Rn (z), where 
1 


Ra) = H@)H(z"') = 12512 

W) = HOH] = 7 — agape ( ) 
For z = e/® we obtain the power spectrum 
; 1 

Ry (e!®) m<o<sT (12.5.13) 


~~ [2 sin (w/2)]°4 
We see that R;, (0) = pa r (1) is finite only if d < 0. Also as the frequency w — 0, the 
power spectrum becomes 


; 1 
R,(e!®) ~ ou asw —> 0 (12.5.14) 


because sin@ ~ 0 as @ — 0. 


Autocorrelation. The autocorrelation r, (J) = Cr, (J) of the model can be found by 
using the inverse Fourier transform of Rp (e/®), that is, 
l' fF ; ‘ol 1 f* @\ —2d 
By i: Rye”) e 7! do = ~| (cos wl) (2 sin *) dw (12.5.15) 
Qn Jin 2m Jo 2 


Using the identity (Gradshteyn and Ryzhik 1994) 

am cos (am /2)P(v + 1)2!-” 
vP[@ +a-+ 1)/2)]T[@ —a+1)/2)] 
(—1)' Pl — 2d) 


we obtain r),(J) = b= 071,2, 042 (12.5.16) 
rd+/l—d)Tda-l—d) 


rs 
/ cosax sin’! xdx = 
0 


for the autocorrelation and 

mY Td-@rd+d) @d+i-—1)! 
m0) Vd@rd+1—d) d—1!d-2a)! 
for the normalized autocorrelation. Using Sterling’s formula, we obtain the following 
asymptotic approximation 


po, = (12.5.17) 


pn) ~ Cgl4-! asl > co (12.5.18) 


which again verifies the long memory of the model. From (12.5.16) and the definition of 
power spectrum, we have 
rd — 2d) 


lore) 1 1 
rn(0) =) h?(n) = =| |H(e!®)|? dw = Paw (12.5.19) 
n=0 os 


Thus, ford < 5 we have pm |H(e/”)|/?d w < oo. Hence, the inverse transform h(n) 
converges in mean square. 


Partial autocorrelation. To determine the partial autocorrelation sequence, we can 
show, using (12.5.17) and the algorithm of Levinson-Durbin, that the AP(7m) model param- 
eters are given by 


k—d—1)\(n—d-—k)! 
am =(" ( en ) (12.5.20) 
k (—d — 1)!(m — d)! 
Therefore, since kj, = agen: we have 
d 
kn = ——— m=1,2,3,... (12.5.21) 
m—d 


The details of the derivation are the subject of Problem 12.6. 


Model memory. From Equations (12.5.10), (12.5.18), and (12.5.14) and from the long- 
memory definitions in Section 3.4, we conclude that the minimum-phase fractional pole 
model has long memory. More specifically, we arrive at the following conclusions: 


5 the autocorrelation and partial correlation sequences 


decay monotonically and hyperbolically to zero. Although )\7°_,, |e()| = oo and 
R(e/®) + oo as w — 0, the integral (12.5.19) of R(e/®) is finite. The spectrum is 
dominated by low-frequency components (low-pass), and the divergence at w = 0 causes 
the long-memory behavior. The system acts as a fractional integrator. 

Short memory. For —4 <d <0 the autocorrelation and partial autocorrelation se- 
quences decay monotonically and hyperbolically to zero. In this case )-7°_,, |p(D)| < ©, 
R(e/®) = o_o PCL) = 0, and the spectrum is dominated by high-frequency compo- 
nents (high-pass). Sometimes we say that this model exhibits short-memory behavior. 
The system acts as a fractional differentiator. 


e Long memory. For 0 < d < 


Figures 12.18 and 12.19 show the impulse response, autocorrelation, partial autocorrelation, 
and power spectrum of the FP(d) model for various values of d. The short-memory and 
long-memory behavior of the model, as a function of parameter d, is clearly evident. 


Discrete-time fractional pole noise. If we drive an FP(d) model with white Gaussian 
noise (see Figure 12.20), the resulting process is known as discrete-time fractional Gaus- 
sian noise (DTFGN). Since the impulse response of an FP(d) system decays hyperbolically, 
its system function cannot be accurately approximated by a rational function. Hence, its 
practical implementation is not straightforward. Short sequences can be generated using the 
LDL” or Cholesky decompositions of the process correlation using the (12.5.16) matrix, 
as explained in Section 3.5. This approach guarantees that the correlation of the generated 
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FIGURE 12.18 
Impulse response, autocorrelation, partial autocorrelation, and power spectrum of the FP(d) 
model for d = 0.1, 0.2, 0.3, 0.4, 0.49. 


sequence matches the theoretical autocorrelation. Since the correlation matrix is Toeplitz, 
its triangular factors can be computed efficiently by using the Schiir algorithm (see Sec- 
tion 7.7). Careful inspection of Figure 12.19 shows that the impulse response of the inverse 
system decays extremely rapidly. Therefore, we can obtain a very accurate recursive im- 
plementation of the FP(d) system by following the approach discussed in Example 4.5.1. 

A practical algorithm for the generation of DTFGN is derived in Hosking (1984) using 
the following result: For any stationary process with zero mean value, the conditional mean 
and variance of x(n) given {x Gir are given by 


py (n) = E{x(n)|x(n = 1), ..., xO} =— Do al™* x(n — j) S —-altx, (12.5.2) 
j=l 


n 
and Vx (n) = Var {x(n)|x(n — 1),...,x(O)} = a? [[a - IkjI*) (12.5.23) 
j=l 
where a, is the forward linear predictor (FLP) with lattice parameters k; and o = 
E {|x(n)|7} (Ramsey 1974). This result implies that we can use the Levinson-Durbin al- 
gorithm to recursively determine jz,(n) and v,x(n), starting at n = O and generating 
x(n) ~ WGN[p, (2), vx (1)] at each step. For the FP(d) model this algorithm is simplified 
because ky, = d/(m — d) is known. The algorithm is initialized with x(0) ~ WGN(O, ee) 
and continues with repeating the following recursions 


—_ (12.5.24) 


Impulse response Autocorrelation 
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FIGURE 12.19 
Impulse response, autocorrelation, partial autocorrelation, and power spectrum of the FP(d) 
model for d = —0.1, —0.2, —0.3, —0.4, —0.49. 


an Ja; 
asi=|o i+, kn (12.5.25) 


b(n + 1) = a xn41 (12.5.26) 

vx(n + 1) = vx (n) = [kn |?) (12.5.27) 

x(n+ 1) ~ WGN[p, (2 + 1), vx (2 + 19] (12.5.28) 

forn = 1,2,...,N. The algorithm is implemented by the function x = dtfgn(d,N). 


Figure 12.20 shows sample realizations of discrete fractional noise, for various values of 
d, generated by using the above algorithm. A simplified, numerically robust algorithm, 
using the lattice structure, is introduced in Problem 12.9. The estimation of long memory 
is discussed in Section 12.6. 


12.5.2 Fractional Pole-Zero Models: FPZ(P, d, Q) 


Since the behavior of the FP(d) model is controlled by the single parameter d, it is not flex- 
ible enough to model the wide variety of short-term (small-lag) autocorrelation structures 
encountered in practical applications. A more powerful model capable of modeling both 
short-term and long-term correlation structures can be obtained by cascading a PZ(P, Q) 
model (to handle short memory) with an FP(d) (to handle long memory). This can be viewed 
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FIGURE 12.20 
Sample realizations of discrete-time fractional Gaussian noise for two different values of d. 


as filtering discrete-time fractional noise with a pole-zero filter. The resulting model is known 
as the fractional pole-zero model and is denoted by FPZ(P, d, Q). The system function is 


1 D(z) 
@—z-)4 A@® 
The FPZ(P, d, Q) is minimum-phase if -} <d< ; and PZ(P, Q) is minimum-phase. 

With regard to the long-range behavior of the model, we can show that as 1 > ov, 
p(l) ~ Cpl74-} (12.5.30) 


Ffp2(z) = (12.5.29) 


where C, 4 0, and as w > 0 

1 (Del? IDO 1 
[fae ACI | (AO) aa 
Parameter d controls the impulse response and the autocorrelation of the model at large 


lags and the spectrum at low frequencies. Parameters a, and d; control the impulse response 
and the autocorrelation of the model at small lags, and the spectrum at high frequencies. 


R(e/®) = (12.5.31) 


Autoregressive fractionally integrated moving-average models. Fractional pole-zero 
models driven by white noise [autoregressive fractionally integrated moving-average mod- 
els (ARFIMA) models] generate random signals whose samples are significantly dependent 
even if they are too far apart. In practice (e.g., geophysics, hydrology, economics) there 
are many time series in which the dependence between samples that are too far away, 
though small, is still too significant to be ignored. Such signals with long-term persistence 
can be effectively modeled using ARFIMA models, because of their flexibility in dealing 
with both short-term and long-term correlation structures. An alternative family of random 
fractal models for modeling long memory behavior is discussed in the next section. 


Harmonic fractional pole-zero models. The FP(d) models with O < d < 5 exhibit 
long memory, but their spectrum peaks at zero frequency and their autocorrelation does 
not have any periodicity. We next discuss a class of harmonic models with long memory, 
periodic autocorrelations, and power spectra that resonate at any frequency in the interval 
0 < w < z. Such models are more appropriate for the modeling of data with strong 
periodicities because they exhibit long memory and pseudoperiodic behavior. 

Let e+/? be a pair of complex conjugate poles on the unit circle and at angles +0 from 
the real axis. Then we have (1 — e/z~!)(1 — e-/8z7!) = 1 — (2cos6)z~! + z~*. The 
harmonic fractional pole model, denoted by HFP(d, @), is a causal system defined by 


1 
(1 — 2z~! cos @ + z~2)4 


[o,@) 

Ho.a(z) = = > hea(n)z" (12.5.32) 
n=0 
where d is a fractional parameter and @ is an angle controlling the location of the peak of the 
spectrum. For @ = 0, Equation (12.5.32) reduces to a standard FP(2d) model. The properties 
of this model are discussed in Problem 12.10. The minimum-phase HFP(d, 0) model can 
be cascaded with a minimum-phase PZ(P, Q) model to obtain an HFPZ(P, d, Q, 0) model 
that offers greater flexibility in controlling both the short-term and long-term correlation 
structure. 


12.5.3 Symmetric a-Stable Fractional Pole-Zero Processes 


Up to this point we have studied linear signal models driven by a sequence of IID Gaussian 
or non-Gaussian random variables with finite variance. However, many practical time series 
including isolated sharp spikes or bursts of spikes can be better described by random signal 
models with infinite variance. To ensure that some signal samples take large values with 
high probability, we need a probability density function with fat or heavy tails. We focus 
on the family of SaS random variables because of their heavy tails and the fact that they 
are invariant under linear transformations. 
As we have seen in Chapters 4 and 5, the linear process 


x(n) =) h(k) wn — k) (12.5.33) 


k=0 


is strictly stationary if (1) wm) ~ HD, Gn) with a7, < oo (finite variance) and (2) 
eae |h(k)| < ov, that is, the system is BIBO stable. However, to ensure stationarity 
when the input is SaS with o,, = oo (power law tails), the sequence |h(k)| should decay 
exponentially. Since the impulse response of a stable pole-zero system decays exponentially, 
its response to an S@S IID sequence is strictly stationary and SaS stable. 

So far, we have discussed the properties of fractional pole-zero models and their re- 
sponse to white noise with finite variance. The following proposition specifies under what 
conditions the output of a PZ(O, d, 0) model with stable excitation is defined. 


THEOREM 12.4. Consider the following fractional pole model FP(d) 


(oe) 
: (d+k— 1)! 
x(n) = So aikyw(n —k) with A(ky= CS (12.5.34) 
k=0 
where w() is IID and SaS. A necessary condition for the series (12.5.34) to converge is 
1 
-—0o0 <d<1-—— (12.5.35) 
a 
When (12.5.35) holds, the series converges in the following sense: 


1. 0 <a@ < 1: absolutely almost surely 
2. 1 <a <2: absolutely almost surely if d < 0 and absolutely surely if d > 0 and uw = 0 
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Proof. See Samorodnitsky and Taqqu (1994). 


We note that because both /(7) and the tails of the input distribution decay as a power 
law, the stability of the model depends on a. Recall that no dependence on the input signal 
exists if the input signal has finite variance, because for E {w2(n)} < oo the stability 
requirement is )-72_,, |A(k) |? < ov. 

The output of the inverse model g(n) = h(n)|q—_q 1s defined for —oo < —d < 1—1/a 
or —(1 — 1/a) < d < ~; hence, the model is minimum-phase if 


1 1 
a (1 = ~) eee (12.5.36) 
a a 


The stability and minimum-phase regions for the FP(d) model with SwS HD excitations are 
shown in Figure 12.21. Theorem 12.4 applies for the model FPZ(P, d, Q) assuming it is 
stable, because it behaves asymptotically as the PZ(0, d, 0) model. 


1.0 


0.5 


s -0.5 


FIGURE 12.21 
Stability (left) and minimum-phase (right) regions for a fractional pole model driven by an 
SaS ID sequence. 


Although a linear stable process is strictly stationary, it is not second-order stationary 
because F{|x(n)| 21 — oo. Therefore, the autocorrelation and the PSD of the process x(n) do 
not exist. However, we can use the normalized autocorrelation of the signal model (12.5.33) 

(oe) 
S> A(h(n = 1) 
ed) = —— (12.5.37) 


~~ a) 


n=—C 


and its Fourier transform to characterize the linear stable process x(n). Clearly, this is a 
legitimate characterization for processes with finite variance and provides a reasonable 
characterization for stable linear processes because of the IID nature of the excitation w(7). 
We can estimate e(/) from a set of data {x (n)}q) 1 using the consistent estimator (Brockwell 


and Davis 1991) 


N-14ll| 


S> x(n)x(n —D) 


n=0 


N-1 
ee) 
n=0 


pl) = 


(12.5.38) 


12.6 SELF-SIMILAR RANDOM SIGNAL MODELS 


In this section, we introduce the family of statistically self-similar or random fractal models, 
which are based on self-similar stochastic processes. Any segment of a self-similar process 
looks similar, in a statistical sense, to a scaled version of a larger segment of the process. 
Because of their practical importance, we focus on self-similar processes with stationary 
increments. We show that the stationary-increments requirement leads to processes whose 
autocorrelation sequences decay hyperbolically, that is, to models with long memory. We 
mainly focus on the fractional Brownian motion (nonstationary) and the fractional Gaussian 
noise (stationary) models, as well as their properties, simulation, and applications. However, 
we provide a brief introduction to self-similar processes with SwS increments, which result 
in random signal models with long memory and high variability. 


12.6.1 Self-Similar Stochastic Processes 


Each time a geologist takes a photograph of a geological object, say, a fossil, she or he 
includes in the picture an object with known scale (e.g., a coin or a ruler), because without 
the scale, it is impossible to determine whether the photograph covers 10 cm or 10 m. For 
this reason we say that geological phenomena are scale-invariant, or that they do not have 
a characteristic scale. 

If we can reproduce an object by magnifying some portion of it, we say that the object 
is scale-invariant, or self-similar. Thus, self-similarity is invariance with respect to scaling. 
Such self-similar geometric objects are known as fractals (Mandelbrot 1982). 

A signal x(t) is self-similar if’ x(ct) = c#x(t). It can be easily seen that a signal 
described by a power law x(t) = at? is self-similar. However, such signals are of limited 
interest. A more interesting and useful type of signal is that exhibiting a weaker, that is, 
statistical, version of self-similarity. A random signal is called (statistically) self-similar if 
its statistical properties are scale-invariant, meaning that its statistics do not change under 
magnification or reduction. Self-similar random signals are also known as random fractals. 

Statistical self-similarity means that small fluctuations at small scales become larger 
fluctuations at larger scales. Therefore, as we analyze more and more data, these ever- 
larger fluctuations increase the value of the measured variance, which in the limit becomes 
infinite. This increase of variance with the length of the data has been observed in the analysis 
of various practical time series that exhibit self-similar behavior. Figure 12.22 provides a 
visual illustration of the self-similar behavior of the variable-rate video traffic time series 
(Garrett and Willinger 1994). 

These ideas can be formalized within the context of the theory of stochastic processes 
by using the following definition. 


The superscript H is an index and not a conjugate transposition operator. For lack of better notation, we will 
continue to use the accepted notation. 
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FIGURE 12.22 

Pictorial illustration of self-similarity for the variable-bit-rate video traffic time series. The 
bottom series is obtained from the top series by expanding the segment between the two 
vertical lines. Although the two series have lengths of 600 and 60s, they are remarkably 
similar visually and statistically. (Courtesy of M. Garrett and M. Vetterli.) 


DEFINITION 12.1. A continuous-time stochastic process x(f) is said to be (statistically) self- 
similar with (self-similarity) index’ H (H-ss) if and only if, for any scaling parameter c > 0, 
the processes x(ct) and c/ x(t) are statistically equivalent, that is, they have the same finite- 
dimensional distributions. Symbolically 


x(ct) £c# x(t) (12.6.1) 
where the symbol Z denotes equality in distribution and, more specifically, equality of finite- 
dimensional joint probability distributions. 


It should be emphasized that individual realizations of the process are not necessarily 
deterministically scale-invariant. The above definition of self-similarity has several impli- 
cations, which can be summarized as follows: 


e Achange in the time scale is statistically equivalent to a change in the amplitude scale. 
Hence, the statistic of x(t) is invariant under the transformation 


x(t) > c- 7 x(ct) (12.6.2) 
e To obtain statistically equivalent processes, the time axis must be scaled differently from 


the amplitude axis. In the language of fractals, we say that the graphs {t, x(t)} and 
{t,c~4x(ct)}, 0 < t < oo, are statistically self-affine because the scaling factor is 


; 
Also known as the Hurst exponent. 


different for the time and amplitude axes. An example of such a self-similar process for 
H= 5 is shown in Figure 12.23. This process, whose distribution at each ft is Gaussian, 
is known as (ordinary) Brownian motion. (A detailed discussion of Brownian motion is 
given in the next section.) The time trace shown in the top plot in Figure 12.23 is gener- 
ated as a discrete equivalent of x(t), using 16,384 samples over unit time interval. When 
it is plotted as a continuous curve, we lose sight of its discrete nature and view it as a 
fractal curve that is indistinguishable from a continuous Brownian—a true fractal curve 
possessing self-similarity at all levels of magnification. Statistical self-affinity of x(t) is 
evident as we zoom into it. The zooming area in the top plot is shown as a box, and the 
scaled curve is shown in the middle plot. Note that we scaled the middle one-fourth of 
the time axis while the amplitude axis was magnified by 2 since 44% = 4!/? = 2. This 
retained the statistical similarity of the middle curve to the original one. Further scaling 
of time axis by 4 and the amplitude axis by 2 is shown in the bottom plot of Figure 12.23. 
Once again the resulting plot is statistically similar to the original one. This Brownian 
motion displayed at different levels of resolution demonstrates the concept of statistical 
self-affinity. 


Sample function of a Brownian motion 
120 T T T 


0 
15/32 1/2 17/32 
Time t 


FIGURE 12.23 
Statistical self-affine property of the Brownian motion trace. 


e If we set ct = | in (12.6.2), we have 


x()£t#x01) ot > 0 (12.6.3) 


Therefore, self-similar processes cannot be stationary, except for H = 0. This nonsta- 
tionarity of the Brownian motion trace x(t) of Figure 12.23 is shown in Figure 12.24, 
which illustrates the spreading of signal values about the mean value of zero as time 
increases. For display purposes, 10 sample functions of x(t) are shown, all of which 
begin at x(0) = 0. This spreading is in a statistical sense, in that some traces return to 
zero and some cases return even more than once. To determine this statistical spreading, 
100 sample functions were used, and the sample standard deviation 0, (t) at each t was 


727 


SECTION 12.6 
Self-Similar Random 
Signal Models 


728 60 


CHAPTER 12 
Further Topics 


x(t) 


vaksoage hai, be “ ¢ 
an 


0 0.25 0.5 0.75 1.0 
Time t 


FIGURE 12.24 
The diffusion property of the Brownian motion trace. 


computed. This +o,.(t), shown as dashed lines in Figure 12.24, clearly indicates the 
diffusion (or nonstationarity) property of Brownian motion. Note that since the standard 
deviation is proportional to E{|x(t + A) — x(t)|}, we have 


eer“ ate (12.6.4) 


for the Brownian motion, and the dashed line in Figure 12.24 confirms it. 

e For strict-sense self-similar processes, all finite-dimensional distributions are equal. How- 
ever, for wide-sense_ self-similar processes, only second-order moments are equal. From 
(12.6.2) these moments are given as 


p(t) © E {x(t} = 0" py (ct) (12.6.5) 
re(ti, 2) = E (x()x(t2)} = ¢ rx (ct, eta) (12.6.6) 
Clearly, for Gaussian processes the two types of self-similarity are equivalent. 


Because of their practical importance, we focus on self-similar stochastic processes 
that have stationary increments. 


DEFINITION 12.2. A real-valued process x(t) has stationary increments if 
x(t +7) — x(t) & x(t) — x(0) for all t (12.6.7) 


In practical applications, the nature of processes with stationary increments is analyzed 
using a quantity known as the semivariogram, defined by 


Ux (t) = SE{[x(t +1) —x()]7} (12.6.8) 
which, for stationary processes, reduces to 
Ux (T) = 2 [rx (0) — rx(T)] (12.6.9) 
We next turn our attention to self-similar processes with stationary increments 


DEFINITION 12.3. A continuous-time stochastic process is self-similar with stationary incre- 
ments (H-sssi) if and only if 


¢ It is self-similar with index H. 
¢ It has stationary increments. 


As shown in the following theorem, the requirements for self-similarity and stationary 
increments completely specify the second-order moments of the underlying process x(f). 


THEOREM 12.5. The mean value, variance, and autocorrelation of an H-sssi process are given 
by, respectively 


Ly (t) =0 (12.6.10) 
o2 (t) = 7452, (12.6.11) 
Pets ty) = 50% (PA = It — PF? + [ol?) (12.6.12) 


where 04, = E{x?(1)}. 


Proof. From (12.6.2) we have, for t = 0, 


x (0) £ cx (0) = c-4#x (0) > x0) =0 (12.6.13) 
Also from (12.6.2) and (12.6.3), we conclude that 
y(t) = E{x(t)} = E{c7 7 x(ct)} = c7 7 E{x(ct)} = 14 E{x(1)} (12.6.14) 
Using the stationary increment property (12.6.7), (12.6.13), and (12.6.14), we obtain 
E{x(t+1)-—x(D} =E{x() —xO}=E lx} =r E {x (1)} (12.6.15) 
Using the self-similarity definition, however, we have 
E{x(t+1) —x(t)} =[@ +1)" — rc ]E{x(1)} (12.6.16) 
Comparing (12.6.15) and (12.6.16), we conclude that E'{x(1)} = 0; hence from (12.6.14) 
by) =0 


which proves (12.6.10). Similarly, since x(t) Z 17 x(1), for t > 0, we have 
02 (t) = E(x? (t)} = PP EL? (D} = 740}, 


which proves (12.6.11). Finally, again using stationarity of the increments and (12.6.11), we 
obtain 


E{lx(t) — x(t) PF} = Ellx(t) — 2) — xO) P} = 074 — ny? (12.6.17) 


or E{lx(t1) — x(a) PF} = Elx?(4)} + E(x? ()} — 2E(x(4)x(2)} 
(12.6.18) 


2 2H 2-254 
=oRt + oHts — 2rx (ty, t2) 


where r; (ft), tz) is the autocorrelation function of x(t). Combining the last two equations, we 
obtain 


re(t1t2) = 50 q[tT" — (n — 2)°F +5") (12.6.19) 
which completes the proof of the theorem. 


Self-similar processes with stationary increments are well-defined if H > Oand x(0) = 
0 with probability 1 (Vervaat 1987). For H = 1, x(t) = |t|x(1); that is, the realizations are 
lines through the origin, and the process is of no interest. For H = 0, we have x(t) = 0, 
which is a trivial process. For H < 0 the process is not mean square continuous, and 
for H > 1 the increments are nonstationary. The permissible range of H is determined 
by the existence of moments: If x(t) is H-sssi with finite variance, then0 < H < 1 
(Samorodnitsky and Taqqu 1994). 

The autocorrelation (12.6.12) shows that H-sssi processes are nonstationary. Despite 
this nonstationarity, we can define a time-averaged spectrum. Since small scales correspond 
to large frequencies and large scales to small frequencies, the amplitude of the fluctuations 
is small at high frequencies and large at low frequencies. In light of the previous discussion, 
it should not come as a surprise that the power spectrum of self-similar processes follows 
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a power law, that is, is proportional to 1/|F|*. Indeed, it has been shown (Flandrin 1989) 
that the time-averaged power spectrum of an H-sssi process is given by 


aca 12.6.2 
RP) = Sa (12.6.20) 
where F is the frequency in cycles per unit of time. As we can easily see, R,(cF) = 
c_@4+) R.(F), which shows that the process is wide-sense self-similar. 


12.6.2 Fractional Brownian Motion 


If we restrict the probability distribution of an H-sssi process to being Gaussian, we obtain a 
unique process known as the fractional Brownian motion, abbreviated as FBM (Mandelbrot 
and Van Ness 1968). These FBMs have Hurst exponents in the range 0 < H < 1. The 


(ordinary) Brownian motion of Figure 12.23 is a special case of FBM when H = ‘. 
DEFINITION 12.4. A Gaussian H-sssi process, 0 < H < 1, is called fractional Brownian motion 
(FBM ) and is denoted by By (ft). 


There are several equivalent definitions of FBM process, which are summarized by the 
following theorem (Samorodnitsky and Taqqu 1994; Beran 1994). 


THEOREM 12.6. If0 < H < 1 and On = E{x*(1)}, the following statements are equivalent 


1. By (t) is Gaussian and H-sssi. 
2. By (t) is fractional Brownian motion with self-similarity index H. 
3. By (t) is Gaussian and has mean zero for H < 1 and autocorrelation function 


rBy (ts te) = 50% (127 — [ty — 174 + |?) (12.6.21) 
EXAMPLE 12.6.1. In Figure 12.25 we show time traces of FBMs for H = 0.2,0.5, and 0.8. 
Clearly, in these traces there is a qualitative difference between each trace that is very noticeable. 
For a low value of H = 0.2, the trace shows more fractured or crinkled behavior. This behavior 
occurs for 0 < H < 0.5, and the corresponding traces have tendencies to turn back upon 
themselves (negative correlation). The corresponding property is known as antipersistence. A 


Fractional Brownian motion 
0 5 T T T 


Bool) 


Bo s(t) 


Bos) 


“0 0.25 0.5 0.75 1.0 
Time t 


FIGURE 12.25 
Time traces of fractional Brownian motion for H = 0.2, H =0.5, and H = 0.8. 


stock market fluctuation is a good example of this process. As H increases, the amount of crinkle 
reduces. For H = 0.5 we have the (ordinary) Brownian motion for which the correlation is zero, 
and the trace shows no preferred tendency to turn back or persist in the same direction (neutral 
in persistence). For a high value of H = 0.8, the trace is smoother, and in fact for 0.5 < H < 1, 
the FBM traces show persistence in the direction in which they are moving (positive correlation). 
This property is known as persistence. Typical coastlines (boundaries between land and water) 
are good examples of such traces. 


From the above example, we note that the fractal behavior of traces diminishes as 
H increases from 0 to 1. Hence there must be an inverse relationship between H and the 
fractal dimension D (also known as the Haussdorff dimension). The concept of dimension is 
closely related to the property of self-similarity or scaling. For the purpose of discussion, let 
us consider our natural Euclidean dimensions. In one dimension, a line segment possesses 
a scaling property. If it is subdivided into N identical line segments, then each segment is 
scaled down by the ratior = 1/N from the whole, or Nr = 1. Asquare is atwo-dimensional 
plane that possesses the scaling property. If it is subdivided into N equal squares, then each 
square side is scaled down by a factor of r = 1//N, or Nr? = 1. Carrying this analysis 
to the cube in three dimensions, we observe that if a cube is subdivided into N identical 
cubes, then each subcube edge is scaled down by the factor r = 1/VN, or Nr> = 1. Now 
we can generalize this analysis to an arbitrary noninteger dimension D. If a D-dimensional 
object is subdivided into N identical copies of itself, then the side of each copy is scaled 
down by the ratio r = 1/ */N, or Nr? = 1. Thus we obtain 


_ logN 
~ Jog (1/r) 


The above approach can also be used to determine the fractal dimension D of the FBM 
traces and to relate it to the Hurst exponent H. One interesting technique for determining the 
fractal dimension is known as box counting. The basic idea is to compute the total number 
N of enclosing boxes (or rectangles) needed to cover all identical subtraces that have been 
scaled down by the ratio r from the whole trace and then use formula (12.6.22) to estimate 
the fractal dimension. Refer to the top plot of Figure 12.23. The enclosing box shows that 
if the whole trace is divided into 4 identical subtraces, then the box height is scaled down 
by Gy)? = 5. Thus the area of each rectangular box is 


1 1 1 1 
4)\2)~ 4372 ~ 41412 (12.6.23) 


However, we have to relate the scaling of smaller (identical) square boxes to the original 
box since the original is a square box of unit side length (this implicitly assumes that 
the amplitude axis in Figure 12.23 is unity, which is not unreasonable since we are using 
fractions). The smaller square boxes of side length r = i have area equal to 1/4”. Thus the 
number of square boxes required to cover each subinterval is (note the box counting) 


(12.6.22) 


1/43/2 1 
“1/42 4i72-1 (12.6.24) 


Since there are 4 subintervals in Figure 12.23, the total number of square boxes required to 
cover the whole trace is 


1 1 
N=4 (a1) = Apa (12.6.25) 
Hence substituting (12.6.25) into (12.6.22), and using r = i we obtain 
log (1/4'/2-7) Jog 42-1/2 1 
p= bell/ ) _ log 69 275 (12.6.26) 


log (1/4) log4 2 


731 


SECTION 12.6 
Self-Similar Random 
Signal Models 


732 


CHAPTER 12 


Further Topics 


which is the fractal dimension of (ordinary) Brownian motion. Generalizing to0 < H < 1, 
we can show that (see Problem 12.12) 


pao (12.6.27) 


Thus the sample paths of fractional Brownian motion are fractal curves with Haussdorff 
dimension D = 2 — H (Falconer 1990). Referring to Figure 12.25, we see the fractal 
dimensions of the fractional Brownian motions are D = 1.8 for H = 0.2 (antipersistent), 
D = 1.5 for H = 0.5 (Brownian motion), and D = 0.8 for H = 1.2 (persistent). Thus the 
more wiggly the trace, the higher the dimension. 


Continuous-time fractional pole systems. In Section 12.5 we used a discrete-time 
fractional pole to obtain a system with long memory. When this system is driven by a 
WGN process, the result is a long-memory process called discrete-time FGN. This leads 
to the following question: Can we use a continuous-time fractional pole to obtain a long- 
memory system that could be used to generate a long-memory process in general and frac- 
tional Brownian motion in particular? The answer is yes, so now we provide an intuitive 
engineering explanation. 

For any d > 0, we have the following Laplace transform pair (Abramowitz and Stegun 
1970) 

ha(t) = 1 oa Gh eee = s (12.6.28) 
Td) sd -~ 
where I"(-) is the gamma function. Note that for d = 1, h(t) corresponds to an ideal 
integrator. However, for fractional d, the function hg(t) has a hyperbolic decay. The result 
is a system with long memory called the fractional integrator. These topics are the subject 
of a discipline known as fractional calculus (Oldham and Spanier 1974). 
The output of the fractional aaa is provided by the convolution integral 


X(t) = Td = |e —1)4-lu(t — t)w(t) dt (12.6.29) 


which satisfies the scaling property 


d-1 d-1 
y(t) = rd mak tT)" wct)dt = aT = fe- A) wA)drA = c~ 4y(ct) 
(12.6.30) 
where A = ct. Linear systems that satisfy (12.6.30) are said to be linear, scale-invariant 
systems (Wornell 1996). We emphasize that while linear, shift-invariant systems with ratio- 
nal system functions have memory that decays exponentially, linear, scale-invariant systems 
exhibit self-similarity and long (hyperbolically decaying) memory. 

Intuition suggests that the output of scale-invariant systems, driven by white noise, 
should exhibit statistical self-similarity. Indeed, it can be shown that linear, scale-invariant 
systems can be used to generate fractional Brownian motion processes (Samorodnitsky and 
Taqqu 1994). More specifically, the fractional Brownian motion process can be generated 
by passing white noise through a linear, scale-invariant system 


[o.@) 
Bult) = [ h;(t) w(t) dt (12.6.31) 
—cCo 
with the following causal impulse response 
1 
h —— {[(t — 1) 4] #71? — [(-1) 471? 12:6:32 
(Tt) = Cun tl T)+] [(—T)+] } ( ) 
1 1? 
where C(H) = tf pelea Ye Sta te ae sat (12.6.33) 
0 


4 a tee? (12.6.34) 
i “+10 ifu<0 me 


We note that the change from the impulse response (12.6.28) to (12.6.32) was introduced by 
Mandelbrot (1982) to ensure that By (t) has the required properties (Wornell 1993; Kasdin 
1995). An equivalent harmonizable representation of fractional Brownian motion in the 
frequency domain is also derived in Samorodnitsky and Taqqu (1994). 


12.6.3 Fractional Gaussian Noise 


The discrete fractional Gaussian noise is a stationary sequence obtained by periodically 
sampling the fractional Brownian motion process By (t) and then computing the first dif- 
ference. The resulting random sequence is x(nT) £ By(nT) — By (nT — T), where T is 
the sampling interval. Since the fractional Brownian motion process is statistically scale- 
invariant, we set T = 1. Therefore, the discrete fractional Gaussian noise process is defined 
by 
x(n) = By(n) — By(n—1) (12.6.35) 

and it is simply referred to as FGN. 

We next determine the second-order moments, that is, the autocorrelation and PSD of 
the FGN process. 


THEOREM 12.7. The autocorrelation sequence of the discrete fractional Gaussian noise is 
rxy() = 504 (Il — 124 — 227 + \1 4124) (12.6.36) 


Since the correlation depends only on the distance / between the samples, the process is wide- 
sense stationary. 


Proof. Using (12.6.21) and (12.6.35), we can easily show that 
E{x@)x(n—D} = E {[By(@) — Ba — 1)I[Ba@—1) - Baa —-1-1)]} 
= 505,10 — 174 — 2774 + 041°] 
which leads to (12.6.36). 


Figure 12.26 shows the autocorrelation sequence for various values of the self-similarity 
index H. Note that for H = 5 we have r; (1) = 6(/), which shows that the FGN process is 
white noise. 


THEOREM 12.8. The power spectrum of the FGN process x(n) is given by 


[o@) lee) 
1 
Ry (e/®) = De J! = 62, Call —e f°? ——— 12.6.37 
x(e/®)= DY We pene AEG ES ( ) 
l=—oo k=—00 
where Cy =2AT (2H) sin(z A) (12.6.38) 


is a constant dependent on the self-similarity index. 


Proof. Arigorous proof can be found in Samorodnitsky and Taqqu (1994). Here we provide a 
more heuristic proof. The sequence x(n) is obtained by sampling the FBM process By (t) every 
T = | time unit, that is, evaluating s(n) = By(nT), and then computing the first difference 
x(n) = s(n) — s(n — 1). From the sampling theorem (see Chapter 2) we have 


[ee 


F 1 @ 22k 
JO, _ 
Rs(e?") = T . RB (3 + T ) 


k=—00 


where w = QT. The frequency response of the first-difference filter is H(e/®) = 1 — e~J®, or 
|H(e/®)|2 = 2(1 —cosw) = 4 sin? (5) 


Since, from (12.6.20), 
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FIGURE 12.26 
Autocorrelation sequence of FGN for H = 0.1 to H = 0.9 at 0.1 increments. 


the power spectrum of x (7) is 


. . : 7 & 
Ry (el) = |H(e!®)P Rs(e!®) = 207% CH (1 — cos) = 


k=—0o 


1 
lo + 2nk|2#41 


which results in (12.6.37) for T = 1. 


Figure 12.27 shows the PSD of FGN for various values of the self-similarity index H. 
Note that for H = 5 we have have a flat PSD, which shows that the FGN process is white 
noise. 


Self-similarity. The discrete FGN process is asymptotically (i.e., at large scales) self- 
similar. Indeed, the autocorrelation 


r(l) ~ 07, H(2H — 1\l|?4-? as |I| > 00, H #3 


decays hyperbolically for large lags, and the PSD 
oH 1 
R(e!®) nc CHT pH=t as |a| = 0, A # 2 
follows a power law as the frequency becomes very small, that is, as the period becomes 
very large. 


Process memory. The FGN process has long memory for 5 < H <1 because the 
summation Pa r(l) = ©, or equivalently R(e/®) > ocoas |w| — O. In this case 
the autocorrelation decays very slowly, the frequency response resembles a low-pass filter, 
and the resulting realizations look smooth. In contrast, the process exhibits short memory 
forO0 < H < i, because )°7o_ 4 Ir(D|_ < co and SY 45 r() = 0, or equivalently 


R(e/®) — Oas |w| — 0. In addition, forO < H < i the correlation is negative, that is, 
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FIGURE 12.27 
PSD function of FGN for H = 0.1 to H = 0.9 at 0.1 increments. 


r(l) < Ofor/ 0; and the process exhibits negative dependence, or antipersistence. In this 
case the autocorrelation decays very rapidly, the frequency response resembles a high-pass 
filter, and the resulting realizations look rough. 


Comparison between FGN and FPN. The discrete-time FGN and FPN processes have 
been independently introduced and have been developed using different approaches. How- 
ever, close inspection of their second-order statistics reveals some interesting similarities 
and differences, which are summarized in Table 12.3. The most interesting feature is that 
both processes become asymptotically self-similar at large scales. 


12.6.4 Simulation of Fractional Brownian Motions and Fractional Gaussian Noises 


Although statistical self-similar processes are relatively easy to describe notationally [see 
(12.6.1)], they are not easy to generate since there is no explicit (or compact) mathemati- 
cal formula to do so. The FBMs and FGNs are special cases of these processes that have 
independent increments with underlying distribution that is Gaussian. Although an explicit 
formula exists for FBM [see (12.6.31)], the additional complication is that we cannot gen- 
erate a continuous trace (this would require infinitely long memory). We can only hope to 
generate an approximate, sampled version of the process on a computer. Thus, as explained 
before, these simulations are not self-similar at all scales. Nevertheless, we provided plots 
of these processes in Figure 12.25 for various values of the self-similarity index H. This 
can be done via techniques that either use properties of the processes or employ indirect 
approaches. In this section, we provide a brief summary of some of these techniques. For 
more detailed discussion see Samorodnitsky and Taqqu (1994) and Barnsley et al. (1988). 
We begin with the simulation of ordinary Brownian motion, which is easy to generate. 


Cumulative-sum method. This technique is a direct method that is suitable for gener- 
ating FBM for H = 0.5. We note that the increments of this process not only are stationary 
but also are uncorrelated with one another. These increments then form the WGN process, 
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TABLE 12.3 


which can be simulated on a computer. Thus by integrating WGN we can obtain the ordi- 
nary Brownian motion. For discrete FBM, this requires taking the cumulative sum of the 
generated WGN sequence. Therefore, the steps in generating the ordinary Brownian motion 
are as follows: 


1. Subdivide the time axis into a sufficiently fine grid. Let the number of grid points be NV. 

2. Generate N independent Gaussian random numbers with mean 0 and variance o?. In 
MATLAB this can be done using the randn (N,1) function. 

3. Obtain a cumulative sum of the random numbers obtained in step 2 above. In MATLAB 
use the cumsum function. The resulting sequence is a discrete approximation of ordinary 
Brownian motion. 


Similarities and differences between discrete fractional Gaussian noise and discrete fractional pole noise. 


Discrete fractional Gaussian noise Discrete fractional pole noise 
CO 
Tik+d 
i x(n) =O C48) wt 8) 
Definition x(n) & By(n) — By(n— 1) Ko PEF DPD 
w(n) ~ WN(0, ”) -} <d< ; 
2 1 
—1)'Td — 2d 
Autocorrelation rQ) = 4o3,ql—1P4 — 210/24 + + 1/77) oa ep ) 
2 rd+/—d)rd-—l—-d) 
oo 2, 2 
: Cyoty (1 — cosa) . o 
Power spectrum R(el®) = bei R(e/®) = ———__. 
‘ a“ Pe |o+ 2nk/2A+! ia [2 sin(w/2)]?4 
Self-similarity (as |/| > 00) rQ)~o%,HQH-DiPH? Ag : rd) ~ Cg l24-! > d =H ; 
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Y> r@ = RO) =0 " ee ee 
Pica Ye Ir@l<co rr] <0, forl ¢ 
l=—oo 
. . d 
Partial correlation m= firs 1, 2, 3 50%5 
m —d 


A MatTLas function to implement the above steps is explored in Problem 12.13. Since 
it is difficult to generate properly correlated random numbers, the cumulative-sum method 
is not suitable for H other than 0.5. 


Spectral synthesis method. This method can be used to generate FBM with an index 
0 < H < 1. The basic principle behind the spectral synthesis approach is that if we can 
construct its spectral density function R,(F), then we can obtain the corresponding FBM 
through inverse transformation. From (12.6.20), we have 


1 
Ry(F) x ag pater +1 (12.6.39) 


Also, similar to the spectral density function relation (5.3.2) for discrete-time stochastic 
processes, we have 


R,(F) = Zim. Fecau (12.6.40) 


Thus from (12.6.39) and (12.6.40) it is possible to obtain a frequency-domain method for 
approximating samples of an FBM with 0 < H < 1. Let {x(n)} be the sample functions of 
an FBM with Hurst parameter H. Then its DTFT magnitude |X (e/®)| has the form 


|X (e/”)| x gs a (12.6.41) 


1 
|w|8/2 
Since this is a continuous function, we use the DFT approach to obtain samples in the time 
domain. If we sample X (e/”) at N equispaced frequencies w, = 27k/N,O<k<N-—1, 
then the DFT magnitude has the form 


: Ne ee 
Sree 


5 B/2 
|X (k)| a e (12.6.42) 
[X(N — k)| ykEN=1 


The phase of be (k) can be chosen to be random, uniformly distributed over [—77, 27] subject 
to the constraint of odd symmetry. Finally, taking the IDFT of X(k) results in a sequence 
that approximates samples of the FBM with H = (6 — 1)/2. The steps of this spectral 
synthesis method can be summarized as follows: 


1. Given H, determine 6B = 2H + 1. 

2. Choose sufficiently large N, and use a suitable proportionality constant to generate 
|X (k)| according to (12.6.42). 

3. Randomize phase 6(k); that is, generate phase values according to 


IA 
Pal 


: N 
uniform random number over [—z., 7] 0 < a 


O(k) = (12.6.43) 


N 
—6(N —k) 5 RRSN SS 


4. Assemble X(k) 1 |X (k)| exp j0(k),0 <k < N—1, and determine the IDFT to obtain 
x(n). 


One major problem with this technique is that the resulting sequence is periodic with 
period N due to the DFT operation (or sampling in the frequency domain). Therefore, to 
avoid these boundary problems, a middle third of the sequence is used as a representative 
FBM trace. The FBM traces shown in Figure 12.25 were generated using the above steps. 
A Mat Las function to implement the above steps is explored in Problem 12.14. 

Note that the corresponding FGN sequence is obtained by taking a first-order difference 
of the generated FBM sequence, that is, 


w(n) = x(n) — x(n — 1) l<n<N-1 (12.6.44) 


Random midpoint replacement method. This is another direct method to produce 
FBM and is based on the scaling property of the increments [from (12.6.11)] that 


var [ABy(t)] = |At|*40%, (12.6.45) 


The approach is to begin generating random sequence values at the endpoints of the interval 
and then successively decimate the interval and generate a random value at the midpoint 
of the smaller interval according to (12.6.45). Therefore, this method can be implemented 
recursively. To generate an FBM over the interval [0, 1] with parameter H, the following 
steps can be used: 
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1. Choose By (0) = 0 and select By (1) equal to a Gaussian random number with mean 0 


and variance a, since 


o?, = E{B?,(1)} 
Clearly, var[By (1) — By (0)] = 1°07, = 0%. 
2. For the first stage, set By G) to be the average of By (0) and By (1) plus some indepen- 
dent Gaussian number offset 5; with mean zero and variance ae that is, 


Bu(5) = 31BH() — Bu) +41 (12.6.46) 


Thus Bu(3) — By(0) and By (1) — Bu (3) have mean 0 and variance 


1 1 
var [Bu (5) = Bul) = q var [Bx (1) — By(0)] + var (61) 


(12.6.47) 


iy" 1 
2 2 
(5) On = —oy + var (61) 


1\22 1 /1\28 
_ 2 
or var (61) = (5) ae (=) OH (12.6.48) 


3. At the second stage, we generate By G) and By (3), using the above method specialized 


to At = i, that is, 
B : = : B : By (0)| +6 


B : ext Bri B : ) 
u(3)=3 [800 - »(5)|+ 22 


1\22 1 /1\28 
: 2 2 2 
with var (621) = var (422) = (x) ay, (=) OH (12.6.49) 
4. Continuing in this fashion, at stage r we generate 2’~! midpoints as the average of their 
respective endpoints plus a Gaussian random number offset 5,.4,k = 1,2,...,2” = 


with variance 


1\24# 4 1 \24 4 1\24 aah 3 
vata) =[($) i (==) Ja-(5) pice SE CG 0) 


1 
— wan Vat Sr—1,k) (12.6.51) 


Thus, as expected for an FBM, at time scale 1/2” we add randomness with mean 0 and 
variance proportional to (1/2” 2 according to (12.6.50). Also from (12.6.51) we can 
recursively generate the variance at each stage. 

5. Stop the procedure when a sufficient number of trace points are generated. 


This method also suffers from a few shortcomings. The most troublesome problem is 
that once a given midpoint is generated, its value remains unchanged in all later stages. Thus 
points generated at different stages have different statistical properties in their neighborhood. 
This produces a visible trace that does not seem to go away even if more stages are added, 
and the artifact is more pronounced as H — 1|.A MATLAB function implementing the above 
steps in a recursive fashion is explored in Problem 12.15. Once again, the corresponding 
FGN sequence is obtained by taking a first-order difference of the generated FBM sequence. 

The generation of one- and higher-dimensional FBM is a very popular subject in engi- 
neering, sciences, and computer graphics. More information and additional references can 
be found in Mandelbrot (1982), Maeder (1995), Peitgen et al. (1988), and Samorodnitsky 
and Taqqu (1994) and in the vast literature on fractals. 


12.6.5 Estimation of Long Memory 


The estimation of the self-similarity index H or the long-memory parameter d = H — 5 is 
a very difficult task. A summary of the most widely used methods, including an empirical 
evaluation, is provided in Taqqu et al. (1995). Additional information can be found in Beran 
(1994), Beran et al. (1995), and Brockwell and Davis (1991). We next present two simple 
methods that exploit the definition of self-similarity in the time and frequency domains 
(Pentland 1984; Beran 1994). 

For any self-similar process x(n) and any integer A > 0, the increments Ax(n) & 
x(n + A) — x(n) have zero mean and satisfy the relation 


E{[Ax(n)?°}} = CA7# (12.6.52) 
where C is a constant. Taking the natural logarithm of both sides, we have 
In E{[Ax(n)]?} =InC +2HInA (12.6.53) 


which can be used to estimate H using linear regression on a log-log plot. The expectation 
on the left side of (12.6.53) can be estimated by using the mean value of [Ax (n)]?. 

In practice, to avoid the influence of outliers, we use the quantity E{|Ax(n)|}, which 
leads to 


In E{{Ax(n)|} =InC+ HInA (12.6.54) 


where C is aconstant. The expectation in (12.6.52) is estimated by the mean absolute value, 
and H is determined by linear regression. This approach is illustrated in Figure 12.28, which 
shows the estimation of the self-similarity index H for two realizations of an FBM process 
(for details see Problem 12.16). We note that in practice the range of scales extends from 1 
to 0.1N, where N is the length of the used data record. 

We have seen that for | f| — 0, the PSD of FBM, FGN, and FPN follows a power 
law 1/f*, where 8 = 2H + 1 (FBM), 6 = 2H — 1 (FGN), and 6 = 2d = 2H — 1 (FPN). 
Therefore, another method for estimating the long-memory parameter H, is to compute 
an estimate of the PSD (see Chapter 5), and then determine H by linear regression of the 
logarithm of the PSD on the logarithm of the frequency. In practice, we only use the lowest 10 
percent of the PSD frequencies for the linear regression because the power law relationship 
holds as | f| — 0 (Taqqu et al. 1995). The PSD estimation of power law processes using 
the multitaper PSD estimation method is discussed in McCoy et al. (1998), which shows 
that using this method provides better estimates of long memory than the traditionally used 
periodogram estimator. 

In practice, data are scale-limited: The sampling interval determines the lowest scale, 
and the data record length determines the highest scale. Furthermore, the scaling behavior 
for a certain statistical moment may change from one range of scales to another. When 
we try to make predictions from an adoption of a scale-invariant model, there are certain 
discrepancies between theory and practice. In theory, the power increases with wavelength 
without limit, and the variance increases with profile length without limit. In practice, 
the power for long wavelengths is not as large as predicted by extrapolating the power law 
trend observed at short wavelengths (frequency domain), and the variance does not increase 
without bounds as the profile length increases (spatial domain). 


12.6.6 Fractional Lévy Stable Motion 


If we assume that the probability density function of the stationary increments is S@S, the 
resulting self-similar process is known as fractional Lévy stable motion (FLSM). However, 
unlike the FBM process, the second-order moments of the FLSM process do not exist 
because SaS distributions have infinite variance. The realizations of FLSM resemble more 
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FIGURE 12.28 
Sample realizations of an FBM process and log-log plots for estimation of H using linear 
regression. 


spiky versions of FBM realizations because of the heavy tails of the stable distribution. 
Hence, FLSM processes provide an excellent model for signals with long memory and high 
variability. 

Formally, an FLSM process Ly. (t) is best formulated in terms of its increment pro- 
cess XH,¢(n), known as fractional Lévy stable noise (FLSN). The FLSN is defined by the 
stochastic integral (Samorodnitsky and Taqqu 1994) 


n 


xHo(n) = Lye t1)—Lae(n) =C / [a +1—s)#—V/* — (9 — 5)? wy (s) ds 
ai (12.6.55) 


where C is a constant, a is the characteristic exponent of the SaS distribution, and wa (s) is 
white noise from an Sa@S distribution. Notice that for a = 2, Equation (12.6.55) provides 
an integral description of FGN. From Figure 12.29, which shows several realizations of 
FLSM for H = 0.7 and various values of a, it is evident that the lower the value of a, the 
more impulsive the process becomes. The techniques described above for generating FBM 
can be modified to simulate FLSM, by replacing the Gaussian random generator with the 
SaS one described in Chambers et al. (1976) and Samorodnitsky and Taqqu (1994). 

The long-memory parameter H can be estimated by using (12.6.54) and linear regres- 
sion. The PSD method cannot be used because the second-order moments of the FLSM 
process do not exist. The estimation of the characteristic exponent a of the SaS increments 
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FIGURE 12.29 
Sample realizations of the FLSM process for H = 0.7 and various values of a. The spikes 
increase as we go from a Gaussian (a = 2) to a Cauchy (a = 1) distribution. 


is a very difficult task because (1) Sa@S distributions have infinite variance and (2) the in- 
crements are not IID owing to the long-range dependence structure. Further discussion of 
these topics, which are beyond the scope of this book, is provided in Adler et al. (1998), 
McCulloch (1986), and Koutrouvelis (1980). 

Some interesting applications of FLSM to the modeling and interpolation of natural 
signals and images are discussed in Kogon and Manolakis (1994, 1996), Peng et al. (1993), 
Painter (1996), and Stuck and Kleiner (1974). 


12.7 SUMMARY 


In this chapter we introduced the basic concepts of three very important areas of statistical 
and adaptive signal processing that are the subject of extensive research. The goal was to help 
appreciate the limits of second-order statistical techniques, open a window to the exciting 
world of modern signal processing, and help the navigation through the ever-increasing 
literature. 

In Section 12.1 we introduced the basics of higher-order statistics and pointed out the 
situations in which their use may be beneficial. In general, the advantages of HOS become 
more evident as the non-Gaussianity and nonlinearity of the underlying models increase. 
Also HOS is of paramount importance when we deal with non-minimum-phase systems. 
Concise reviews of several aspects of HOS are given in Swami (1998) and Tugnait (1998), 
and a comprehensive bibliography is given in Swami et al. (1997). 

Section 12.2 provided a brief introduction to the principles of blind deconvolution and 
demonstrated that the blind deconvolution of non-minimum-phase systems requires the use 
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of HOS. In Sections 12.3 and 12.4 we introduced the concept of unsupervised adaptive 
filters, which operate without using a desired response signal; and we illustrated their ap- 
plication to both symbol-spaced and fractionally spaced blind equalization systems. A brief 
overview of current research in channel estimation and equalization is provided in Gian- 
nakis (1998). There are three types of unsupervised adaptive filtering algorithms: algorithms 
that use HOS either implicitly or explicitly, algorithms that use cyclostationary statistics, 
and algorithms that use information-theoretic concepts (Bell and Sejnowski 1995; Pham 
and Garrat 1997). We have focused on the widely used family of Bussgang-type algorithms 
that make implicit use of HOS. 

In the last part of this chapter, we provided an introduction to random signal models 
with long memory and low or high variability. More specifically, we discussed fractional 
pole models with Gaussian or SaS IID excitations and self-similar process models with 
Gaussian (FBM, FGN) or SaS (FLSM, FLSN) stationary increments. The recent discovery 
that Ethernet traffic data are self-similar and Sa@S (Willinger et al. 1994) established long- 
memory models as a very useful tool in communication systems engineering. Finally, we 
note that the wavelet transform, which decomposes a signal into a superposition of scaled 
and shifted versions of a single basis function known as the mother wavelet, provides a 
natural tool for the analysis of linear self-similar systems and self-similar random signals. 
The discrete wavelet transform facilitates, to a useful degree, the whitening of self-similar 
processes and can be used to synthesize various types of practical self-similar random 
signals (Mallat 1998; Wornell 1996). 


PROBLEMS 


12.1. Prove (12.1.27), which relates the output and input fourth-order cumulants of a linear, time- 
invariant system. 


12.2 (a) Derive (12.1.35) and (12.1.36). 
(b) Using the formulas for the cumulant of the sum of IID random variables, developed in 


Section 3.2.4, determine a and compare with the result obtained in (a). 


12.3 If x(n) is a stationary Gaussian process, show that E{x2(n)x2(n -)lj= pz (1) and explain 
how it can be used to investigate the presence of nonlinearities. 


12.4 In this problem we use an MA(2) model to explore some properties of cumulants and bispectra. 


(a) Write a MATLAB function k=cuma (b) that computes the cumulant c?) (11, ly) of the 
MA(2) model H(z) = by +. byz7! + boz7? for —L <ly,l) < L. 

(b) Use the functions k=cuma (b), X=fft (x), and X=shiftfft (X) tocompute the bispectra 
of the three MA(2) models in Table 12.1. Plot your results and compare with those in 
Figure 12.2. 

(c) Compute the bispectra of the models using the formula 


RP (ef : e/ 2) = «2 1 (ef1) H (e/©2) H* (ef @1+2)) 


for@, = w2 = 2ak/N,0 < k < N—1.Compare with the results in part b and Figure 12.2. 
(d) Show that the bispectrum can be computed in MATLAB using the following segment of 

code: 

H=freqz(h,1,N,’whole’); 

Hc=conj(H); 

R3x=(H*H’) .*hankel (Hc, Hc([N,1:N-1])); 

R3x=shiftfft (R3x); 


12.5 Using the minimum-, maximum-, and mixed-phase systems discussed in Example 12.1.1, write 
a MATLAB program to reproduce the results shown in Figures 12.3 and 12.4. Usea = 0.4,b = 
0.8, and N = 300 samples. 


12.6 


12.7 


12.8 


12.9 


12.10 


12.11 


12.12 


12.13 


Use the Levinson-Durbin algorithm, developed in Chapter 7, to derive expressions (12.5.20), 
direct-form coefficients, and (12.5.21) for the lattice parameters of the fractional pole model. 


Consider the FPZ(1, d, 0) model 
1 
(i —z-!)¢ (+az-!) 
where -} <d<iand-1 <a <1. Compute and plot the impulse response, autocorrelation, 


and spectrum for a = +0.9 and d = +0.2, +0.4. Identify which models have long memory 
and which have short memory. 


Afpz(Z) = 


Compute and plot the PSD of the FGN process, using the following approaches, and compare 
the results. 


(a) The definition Ry (e/®) = yee ry(De—J@ and formula (12.6.36) for the autocorre- 
lation. 
(b) The theoretical formula (12.6.37). 


Use the algorithm of Schiir to develop a more efficient implementation of the fractional pole 
noise generation method described by Equations (12.5.24) to (12.5.28). 


In this problem we study the properties of the harmonic fractional unit-pole model specified 
by the system function given by (12.5.32). The impulse response is given by (Gray et al. 1989) 


In/2] -_ak i; n—2k 
Restos S (-—D*T(d +n —k)(2cos 80) 


kl(n — 2k)!P (d) 


k=0 
where I(-) is the gamma function. 


(a) Compute and plot hg qg(n) for various values of 6 and d. 
(b) Demonstrate the validity of the above formula by evaluating hg g(n) from Hg q(z) for the 
same values of 6 and d. 


(c) Illustrate that the model is minimum-phase if | cos 6| < 1 and -} <d< 5 orcos@ = +1 


and —2 xd eh, 

(d) Illustrate that the harmonic minimum-phase model, like the FPZ(0, d,0) one, exhibits 
long-memory behavior only for positive values of d. 

(e) Show that for0 <d < i and cos @ = 1, the autocorrelation equals that of the FPZ(0, 2d, 


0) model [multiplied by (—1)! ifcos @ = —1]. When |cos@| < landO <d < 7 illustrate 


numerically that the autocorrelation can be approximated by p(/) ~ —/ 2d-1 sin(6l — 2d) 
as 1 > oo. 

(f) Compute and plot the spectrum of the model for 9 = 2/3 and various values of d. 

(g) Generate and plot realizations of Gaussian HFPZ noise for 9 = 2/6 and d = —0.3, 0.1, 
and 0.4. 


Determine the variogram of the process x(n) obtained by exciting the system 
1 

dae laaz-}) 

with white noise w(n) ~ WGN(0, aay. 


A(z)= ja| <1 


Following the steps leading to (12.6.26), show that the fractal (Haussdorff) dimension D is 
related to the Hurst exponent H by 


D=2-H 


Develop a MATLAB function to generate the ordinary Brownian motion trace according to the 
steps given for the cumulative sum method in Section 12.6.3. The format of the function should 
be x = obm_cumsum(N). 


(a) Generate 16,384 samples of the Brownian motion x(t) over 0 <t < 1. 
(b) Investigate the self-affine property of x(t) by reproducing a figure similar to Figure 12.23. 


743 


PROBLEMS 


744 


CHAPTER 12 


Further Topics 


12.14 


12.15 


12.16 


12.17 


Develop a MATLAB function to generate the fractional Brownian motion trace according to 
the steps given for the spectral synthesis method in Section 12.6.3. The format of the function 
should be x = fbm_spectral(H,N). 


(a) Generate 1024 samples of the FBM By (ft) over 0 < t < 1 for H = 0.3. Investigate the 
self-affine property of Bo.3(t). 

(b) Generate 1024 samples of the FBM By (ft) over 0 < t < 1 for H = 0.7. Investigate the 
self-affine property of Bo.7(t). 


Develop a MaTLAB function to generate the fractional Brownian motion trace according to the 
steps given for the random midpoint replacement method in Section 12.6.3. The format of the 
function should be x = fbm_replace(N). 


(a) Generate 1024 samples of the FBM By (t) over 0 < t < 1 for H = 0.5. Compare 
visually Bo.5(t) with that obtained by using the cumulative-sum method. Comment on 
your observations. 

(b) Generate 1024 samples of the FBM By, (t) over 0 < t < 1 for H = 0.99. Investigate the 
artifact discussed in the chapter for H > 1. 


Based on Equation (12.6.54), develop a MATLAB function [H, sigmaH] = est_H_mad (x) 
that computes an estimate of the self-similarity index H and the variance a, of an FBM 
process. 


(a) Use function x = fbm_replace (N) to generate N = 1024 samples of an FBM process 
with H = 0.3, and use the function [H, sigmaH] = est_H_mad(x) to estimate H and 
O#.- 

(b) Repeat the previous task for H = 0.7. 

(c) Perform a Monte Carlo simulation using 100 trials and compute the mean and standard 
deviation of the estimates for H and o y in (a) and (b). 


Repeat Problem 12.16 by developing a function that estimates the self-similarity index H by 
determining the slope of the first 10 percent values of the periodogram in a log-log plot. 


APPENDIX A 


Matrix Inversion Lemma 


The matrix inversion lemma is a useful formula that is employed extensively in signal 
processing. The purpose of this formula is to express the inverse of a matrix in terms of the 
inverse of one of its additive components, so as to facilitate an efficient computation of the 
inverse. To motivate this lemma, consider the inverse of the following scalar quantity 


(a+xy) l= a+xy 40,a40 
a+xy 
in terms of the inverse of a. Since a + xy £0 anda ¥ 0, we also have 
Ixya'|A1 and lya!x| #1 (A.1) 
Using the convergence of the geometric series formula 
-1 —1,2 | -1 
l1—xya + (xya’’) = i Ixya | Al (A.2) 
14+ xya7 
we obtain 
1 a 


a+xy ~ 1+xya7! 


=a™'[1—xya7! + (xya7!)? — +] 


=a'-a ‘yya ee 'y(ya 'y)ya 


1 1 


—a 'y(ya ly)? ya +--- 


=aq!- a~'xya"[1 _ ya !x + (ya !x)? —--] 
-1,,,,-1 
-1 @ <xya -1 
=q = ———— asx 1 
ero ly |# 

(A.3) 
which is the desired result. We begin with a special case of the lemma in which a is a matrix 
and x and y are vectors. This result then can be generalized to the case in which x and y 
are also matrices. 


LEMMA A.1 (SHERMAN-MORRISON’S FORMULA). Let A bean JAN x NA invertible matrix 
and let x and y be two N x 1 vectors such that (A + xy! ) is invertible. Then we have 


(A+xy?)-! sant so (A.4) 
Proof. Consider 
A+xy4 =A +A7!xy”) 
Hence (A+xy7)—! = + Ar xy4)—lac! (A.5) 
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(A.6) 
=]-— Avlxy4# +A7!xy4Anlyy4 — se 


since from (A.5) (I + Avlxy# ) is invertible. Substituting (A.6) into (A.5), we obtain 


(A+ xy4)-! = Av! — Act yy4Aa-! + Avly (v4 AT!) yFAn! _... 
— —— 
scalar 
=A! A-lyyFa lt — yFA lx 4 (yF Aly? —...] 
Aa Avlxy#a7l 
~ It+y7A-lx 


since the scalar ytA—!x # | due to the invertibility of (I+ Av! xy) [see also (A.1)]. This 
completes the proof. 


The generalization of (A.4), known as Woodbury’s formula, is given by 


(A+ BCD)"! = A7! — A7'B(C7! + DA7'B)~'DAT! (A.7) 
If matrix A is partitioned as 
A A 
oe 11 12 (A.8) 
Ao; Az 
then (A.7) can be used in determining inverses of submatrices contained in 
fet (Ai, — AyA5y Aas)! —(Aq1 — Aq2A5y Aoi) !ApA5y. (A.9) 
—(Az2 — Ar Ay Ai) Ao AT, (Az — AoiAy Ai)! 


where inverses AG and Ay are assumed to exist. 


APPENDIX B 


Gradients and Optimization 
in Complex Space 


In the development of many signal processing algorithms, it is necessary to compute the 
gradient of a real or complex function with respect to a complex vector w. The concepts 
involved in this gradient operation and the application of the gradient in optimization are 
described in this section. For more details see Gill et al. (1981), Kay (1993), and Luenberger 
(1984). 


B.1 GRADIENT 


We begin with a simplest case. Let g(x) be a real scalar function of real parameter vector 
x. Then we define the gradient of g(x) with respect to vector x as a column vector 


g(x) dg(x) dg(x) dg(x) |? 
Vx(g) £2 =| @ Oe eee (B.1) 
ox Ox] 0x2 OxN 
This definition extends to a vector function g(x) of parameter vector x as 
dgi(x) 7? age) dg2%) Ag (&) 
ox Chal Ox] Ox] 
ag(x) 0g2(x) dgi(x) dgo(x) 0 (X) 
Vx(g) & - =| ox =| am dx Ax, (B.2) 
x . . . ‘ % 
dgu (x) dgi(x) ago(x) Agu (X) 
ox OXN OxN OxXN 


Thus Vx(g) is an N x M matrix. Finally, consider a scalar function g(A) of an M x N 
matrix A. We define the gradient of g(A) with respect to A as a matrix 


ag(A) ag(A) ———ag(A) 
0a41 a2 da\N 
dg(A) dg(A) dg(A) 
a ag(A) _ | 5 par a 
Va(g) = ae = a2) a22 a2N (B.3) 
ag(A) ag(A) ———g(A) 
dam dam2 damn 


Using these definitions, we see it is easy to prove the following results: 
Vx(y’ Ax) = Aly (B.4) 
Vx(x! Ay) = Ay (B.5) 
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Vx(x! Ax) = (A+A7)x (B.6) 
Va(x’ Ay) = xy" (B.7) 
Va(x! Ax) = xx! (B.8) 


Now we consider the case of a complex-valued scalar function g(z, z*) of a complex 
variable z and its complex conjugate z*. We assume that the function is analytic with respect 
to z and z* independently’ (in the sense of partial differentiation). An example of such a 
function is 


g(z, 2*) =alz|* + bz* +e = azz* +bz* +c (B.9) 


Let f(x, y) be the complex function of the real and imaginary parts x and y of the variable 
z= x + jy, such that g(z, z*) = f(x, y). Again consider the function in (B.9), then 


f(x, y) =a(x* + y2) +b — jy) +e (B.10) 
= alz*| + bz* +c = g(z, 2*) (B.11) 


The partial derivative of g(z, z*) with respect to z (keeping z* as a constant) is given by 
0 r 1[ 0 0 
— 9(z, = —|— f(x, y)-— j—f, B.12 
apo te =) >| are y) jg, Ie »| (B.12) 


Similarly, the partial derivative of g(z, z*) with respect to z* (keeping z as a constant) is 
given by 


wey y=5[ EK yeaa a | (B.13) 
pnb e) = 5 | 5, f@ Igy Fey 


These results can be easily verified for g(z, z*) in (B.9): 


0 ) 
5c ale" + bz*+c= 9, lees + bz* +c] =az* =a(x — jy) 
z z 


lf a a) 1{o 2. 2 ; 
and | Fron FE re.y)| = 5 | lac ty) + be — jy) +e] 


~j > tate? + y?) + b(« — jy) + al 


dy 
b b 
Satta = jy = a jy) = az* 
Let f (x) be a real-valued scalar function of the complex vector x expressed as 
f(%) = g(x, x") (B.14) 


where g(-) is a real-valued function of x and x*, analytic with respect to x and x* inde- 
pendently (in the sense of partial differentiation). The necessary and sufficient condition to 
obtain an equilibrium (optimum) point of f(x) is that 


Vx(g) = Vx«(g) = 0 (B.15) 


The necessary gradient Vx(g) can be computed by using (B.13). In particular, for any 
complex vector y, x, and matrix A, we have 


Vae(xy) =y (B.16) 
Vx(yx) =0 (B.17) 
Vx«(x Ay) = Ay (B.18) 
Vx«(x Ax) = Ax (B.19) 


i In this approach, the quantities z and z* are considered to be independent of each other. Clearly they are not, 
since z is uniquely determined by its conjugate. Nevertheless, this technique works. 


Vx(x Ax) =x4A (B.20) 
Va(x" Ay) = x*y" (B.21) 
Va(x Ax) = x*x? (B.22) 


B.2)> LAGRANGE MULTIPLIERS 


The procedure of using Lagrange multipliers is an elegant technique of obtaining optimum 
values of a function of several variables subject to one or more constraints. Suppose we 


want to determine the minimum of a function f(x) of N variables x = [x1,..., xv], subject 
to a constraint relating x; through xy given in the form 
g(x) =0 (B.23) 


One straightforward approach would be to solve (B.23) for one of the variables, say x;, in 
terms of the remaining ones and then eliminate x; from f(x). The minimization of f(x) can 
then be carried out in a usual way to determine the minimum point in the N-dimensional 
space. In practice, this approach is all but impossible to carry out, especially if f(x) is 
highly nonlinear. 

A simpler yet elegant approach is to introduce an additional parameter 4, called a 
Lagrange multiplier.’ To motivate this technique through a geometric viewpoint, consider 
a two-dimensional function 


F(x, x2) =x? +3 (B.24) 


which is a bowl-shaped surface whose minimum is at the origin x} = x2 = 0. Thus 
minimizing f (x) is the same as minimizing the length of vector x. If there is no constraint, 
the zero vector is the best x. Now let the constraint be a line 


x2 = —5x1 +3 (B.25) 
in the (x, x2) plane. Thus 
g(x) =x, + 2x2 -5=0 (B.26) 


This constraint and the bowl-shaped surface are shown in Figure B.1. The constraint plane 
cuts through the bowl, creating a parabolic edge, as shown in the figure. Since the point x is 
restricted to the constraint line (B.26), the minimization function f (x) is constrained to the 
parabolic edge. Thus the minimization of (B.24) becomes a problem of finding the point 
on the parabolic curve that is nearest to the origin. This is also the point on the constraint 
line that is nearest to the origin and is obtained by drawing a perpendicular ray, as shown 
in Figure B.1. This point is x; = | and x2 = 2. At this point the parabolic edge achieves its 
minimum. 

How is all this related to the Lagrange multiplier? Referring to Figure B.1, we see at 
any point P on the constraint surface, the gradient of f(x) is given by vector V f. To find 
the minimum point of f(x) within the constraint surface, we have to find the component 
Vf of Vf that lies in the surface and to set it equal to zero, that is, 


Vf =0 (B.27) 


Consider the constraint function g(x) and perturb x to x + 6x within the surface. Then using 
the Taylor expansion, we can write 


g(x + 6x) = g(x) + dx? Ve(x) = g(x) (B.28) 


since x + 6x is chosen to lie within the surface g(x) = 0. This implies that Vg(x) = 0, 
which means that the gradient V g(x) is normal to the constraint surface. As shown in 


"Although we have reserved A for eigenvalues, we will follow the tradition and use A also as a Lagrange multiplier. 
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SQ, X2) 


Constraint 
plane 


Parabolic edge 


x2 


x2 


FIGURE B.1 
Geometric interpretation of Lagrange multiplier. 


Figure B.1, we can now obtain the component V\ f by adding a suitable scaled vector 
V g(x) to the gradient in the form 


Vif = Vf +taVe(x) (B.29) 


where A is a Lagrange multiplier. Using linearity of the gradient operator, we introduce the 
Lagrangian function 


L(x, A) = f(x) + Ag(x) (B.30) 


so that the gradient VC is given by (B.29). 

Therefore, to find the minimum of f(x) subject to g(x) = 0, we first define the 
Lagrangian (B.30) and then find the minimum point of £(x, A) by differentiating it with 
respect to both x and A. This results in N + 1 equations that can be solved to determine the 
optimum xX, and A, from which the minimum f(x.) can be found. Note that d£/dA = 0 
leads to the constraint g(x) = 0. Thus Lagrange multiplier technique leads to the equations 
for a constrained minimum, and it does not require us to solve for g(x) = 0. 

This technique can be extended to more than one, say K, constraints simply by using 
one Lagrange multiplier 4, for each of the constraints g(x) = 0,k = 1,..., K, and 
constructing a Lagrangian function of the form 


K 
LOK AL AK) = f+) Ange (B.31) 
k=1 


This Lagrangian is then minimized with respect to x and {Ag} ; 


EXAMPLE B.1. Consider the problem of fitting the largest (areawise) rectangle inside an ellipse 
given by 


aT ay 
atpo! (B.32) 


The ellipse and an inscribed rectangle are shown in Figure B.2. Thus the objective function that 
we want to maximize is 


f1,%2) = (2x1) (2x2) = 4x1 x2 (B.33) 
subject to the constraint 
xt 
sG1.x)=>+45-1 (B.34) 
a b 
x2 FIGURE B.2 


Ellipse and the inscribed 
rectangle in Example B.1. 


> x) 


Method 1. Solving (B.34) for x2, we obtain 


b 
x2 = te - x} (B.35) 


Since the area is positive, choosing the plus sign and substituting in (B.33), we have 


b 
F (x1, x2) = 4-xiya? — x7 (B.36) 


which is a function of x; alone. Now to obtain the maximum value of f (x1, x2), we set 


2 
d b x 
Og o4? |. pay (B.37) 


FD 
a x7 


Thus from (B.37) we get the optimum value of x; and subsequently from (B.35) the optimum 
value for x2 


a b 
ee ae ee ee (B.38) 
l,o J2 2,0 J2 
Method 2. Let us form the Lagrangian 
x2 x2 
1 2 
L(x1, 42,4) = fry, x2) + AG(X1, x2) = Aaya +A (3 + an ) (B.39) 
a 
Now to find the optimum point, we set 
aL 2x4 
— =0= 4x) +A— (B.40) 
Ox] a 
aL 2x2 
— =0= 4x, +A > B41 
aes xy pe (B.41) 
2 2 
aL XT XD 
eo es Sh ee B.42 
Mae a 
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Solving (B.40) through (B.42), we obtain the optimum values 
a b 

X1.o= = X29= = 

me) J2 re) J2 


Clearly, the second method is more convenient. 


ho = —2ab (B.43) 


EXAMPLE B.2. Let a real-valued random vector y be given by 
y=ax+v (B.44) 


where x is a deterministic vector, a is a constant, and v is a zero-mean random vector with 
covariance matrix Ry. We want to determine a best linear unbiased estimator (BLUE) of a, 
given y. Let 


a@=h’y (B.45) 
Since the estimator must be unbiased, we have 
a = E{a} = Efh’y} = E{h! (ax + v)} = wE{h’ x} = ah’ x (B.46) 
which implies that h? x = 1. Hence the constraint g(h) is 
g(h) =h? x —1 (B.47) 
Next we want to minimize the variance in the estimation 
var(&) = var(h! y) = var(h’ v) = h’ Ryh (B.48) 
Now to obtain the BLUE of «, consider the Lagrangian 
Loh, 4) = hh? Ryh + Ath’ x — 1) (B.49) 
Using (B.5) and (B.6), we obtain 
Va(L) = 2he Ry + Ax? = 07 (B.50) 
or ho = SRylx (B.51) 
Substituting (B.51) into (B.47) and solving for A, we obtain 
i i 2 
x! Ryx 
Finally, the optimum estimator becomes 
he Ry ix (B.52) 
x’ Ryx 


which can be recognized as a whitening filter and a matched filter. 


EXAMPLE B.3. Consider a complex-valued case of the above example. We want to minimize 


f(h) =h? Ryh (B.53) 
where Ry is a real-valued symmetric matrix so that f(h) is real, subject to 
Re{h”x} = b (B.54) 


Consider f (h) and the constraint function g(h) as 
fh, h#) =h” Ryh 


g(h, hh”) =h@x+x"%h— 25 ven 
Thus the Lagrangian is 
Lh, h” 2) — he’ Ryh— ach” x + x"h — 25) (B.56) 
Now using (B.20), we get 
Va(L) = hYR, — Ax”! = 0" = hy = ARV! x (B.57) 
From the constraint (B.55) 
aw Re!x) =b 
which gives hy = bRy 'x (B.58) 


APPENDIX C 


MATLAB Functions 


In this appendix, we provide a brief one-line description of MATLAB functions that were 
referred to in this book. The source of each function is given in parentheses where detailed 
information can be found. Page numbers for functions explicitly discussed in the text are 
also given. 


TABLE C.1 
MATLAB functions. 


Function Description Page 
a2r Direct parameters to autocorrelation conversion 367 
aplatest Estimation of all-pole lattice parameters (Book toolbox) 460 
arls AR model estimation using the LA criterion without windowing (Book toolbox) 451 
armals ARMA model estimation using the LA criterion without windowing (Book toolbox) 466 
arwin AR model estimation using the LA criterion without windowing (Book toolbox) 451 
autoc Computation of autocovariance sequence (Book toolbox) 210 
autocfft Computation of autocovariance sequence using the FFT (Book toolbox) 210 
bartlett Computation of Bartlett window coefficients (MATLAB) 230 
boxcar Computation of rectangular window coefficients (MATLAB) 206 
bt_psd Blackman-Tukey power spectral density computation (Book toolbox) 227 
chebwin Computation of Chebyshev window coefficients (MATLAB) 206 
chol Computation of Cholesky decomposition (MATLAB) 278 
cohere Coherence function estimation (MATLAB SP toolbox) 241 
conv Convolution sum computation (MATLAB) 48 
corr Computation of cross-correlation sequence (MATLAB) 

csd Cross-spectral density computation (MATLAB SP toolbox) 240 
cumsum Cumulative-sum computation (MATLAB) 

df2latcf Direct-form to lattice-form conversion (Book toolbox) 67 
df2ldrf Direct-form to lattice/ladder-form conversion (Book toolbox) 

dpss Discrete prolate spheroidal sequence window coefficient computation (MATLAB SP toolbox) 248 
dtfgn Generation of discrete fractional Gaussian noise (Book toolbox) 721 
durbin Implementation of Durbin algorithm (Book toolbox) 358 
eig Computes eigenvalues and eigenvectors of a matrix (MATLAB) 

esprit_ls Least-squares ESPRIT for frequency estimation (Book toolbox) 493 
esprit_tls Total least-squares ESPRIT for frequency estimation (Book toolbox) 493 
ev_method Eigenvector method for frequency estimation (Book toolbox) 488 
faest FAEST RLS algorithm (Book toolbox) 576 
filter Direct-form-II filter implementation (MATLAB) 50 
filtic Computation of direct-form-II filter initial conditions (MATLAB SP toolbox) 50 
firlms FIR LMS adaptive filtering algorithm (Book toolbox) 526 
hamming Computation of Hamming window coefficients (MATLAB) 206 
hanning Computation of Hann window coefficients (MATLAB) 206 
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TABLE C.1 


MATLAB functions. (Con’t) 


Function Description Page 
invschur Implementation of inverse Schiir algorithm (Book toolbox) 375 
invtoepl Computation of R7! when R is Toeplitz (Book toolbox) 378 
k2r Lattice parameters to autocorrelation sequence conversion (Book toolbox) 367 
kaiser Computation of Kaiser window coefficients (MATLAB) 208 
adrfilt Lattice/ladder filter implementation (Book toolbox) 
atcf2df Lattice to direct-form conversion (Book toolbox) 67 
atcfilt Lattice filter implementation (Book toolbox) 68 
dlt Computes the LDU decomposition (Book toolbox) 277 
dltchol Computes LDL? using chol 278 
drf2df Lattice/ladder to direct-form conversion (Book toolbox) 
duneqs Solution of normal equations using LDU decomposition (Book toolbox) 277 
evins Implementation of Levinson’s algorithm (Book toolbox) 359 
sigest Computation of LS signal estimators (Book toolbox) 288 
smatvec Computation of R and d for FIR LS filtering (Book toolbox) 408 
u LU decomposition (MATLAB) 
mgs Implementation of modified GL algorithm (Book toolbox) 430 
minnorm Minimum-norm method for frequency estimation (Book toolbox) 488 
music MUSIC frequency estimation (Book toolbox) 485 
phd Pisarenko harmonic decomposition (Book toolbox) 484 
pmtm Power spectrum estimation via Thomson multitaper method (MATLAB SP toolbox) 248 
psd Power spectrum estimation via Welch’s method (MATLAB SP toolbox) 213, 232 
pzls Pole-zero coefficient estimation using the LS criterion (Book toolbox) 463 
qr Computation of QR decomposition (MATLAB) 424 
rand Generates pseudorandom numbers that are uniformly distributed over (0, 1) (MATLAB) 83 
randn Generates V (0, 1) pseudorandom numbers (MATLAB) 83 
rls Implementation of conventional RLS algorithm (Book toolbox) 
rootmusic Root-MUSIC frequency estimation (Book toolbox) 485 
schurlg Schiir algorithm (Book toolbox) 370 
stablepdf Computes pdf plots of stable distributions numerically (Book toolbox) 95 
stepdown Lattice-form to direct-form conversion in Levinson algorithm (Book toolbox) 366 
stepup Direct-form to lattice-form conversion in Levinson algorithm (Book toolbox) 366 
svd Computation of SVD (MATLAB) 436 
tfe Transfer function estimation (MATLAB SP toolbox) 243 
toeplitz Toeplitz matrix from first row and column (MATLAB) 48 
triang Computation of triangular window coefficients (MATLAB) 
udut Computation of ubU# decomposition (Book toolbox) 


APPENDIX D 


Useful Results from Matrix Algebra 


In this appendix, we review the fundamental concepts of linear algebra in complex-valued 
space. The aim is to present as many possible concepts as are necessary to understand the 
book. For a complete treatment, refer to many excellent references in literature including 
Leon (1998), Strang (1980), and Gill et al. (1981). 


D.1. COMPLEX-VALUED VECTOR SPACE 


The unitary complex space CN is defined as the space of all the N-dimensional complex- 
valued vectors, which are denoted by a boldface letter, or by the N-tuple of its component, 
for example, 


& = [pan ~ ay)? = Deh xg oo eh? (D.1.1) 


where we use the following notation for the superscripts: JT means transpose, * means 
conjugate, and H means conjugate (of the) transpose, or adjoint. In the case of real-valued 
vectors, the real space is denoted by R™ and is also known as the Euclidean space. 


Some Definitions 


1. The inner product between two vectors x and y is defined by 


N 
yp =x¥y =) xty; (D.1.2) 
i=1 
2. Two vectors x and y are orthogonal if their inner product is zero, that is, 
x4y=0 (D.1.3) 
The zero vector 0 is orthogonal to any vector in the same space. 


3. The norm of a vector provides a measure of the “size” of a vector. It is a nonnegative 
number ||x|| that satisfies the following properties: 


a. ||x|| > 0 for x 4 0 and ||0|| = 0. 
b. |\ax|| = |a|||x|| for any complex number a. 
c. Ix + yl] < [Ix + lly|l (triangle inequality). 


The p norm of x is defined as 


N 1/p 
Ixllp = (>: rt (D.1.4) 


i=l 
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which satisfies all three properties given above. For p = 2, we obtain the Euclidean 
norm ||x||2 which, for simplicity, is denoted by ||x||. It is defined as 


N 
kl =Vx¥x= | 0 bx? (D.1.5) 
i=1 


4. An orthonormalized set is a set of L vectors x;,/ = 1,2,..., L, such that 
1 l=k 
xx, = (D.1.6) 
0 LAk 


5. Cauchy—Schwartz inequality: Two vectors x and y belonging to the same space satisfy 
Ixyl < IIxll- lly (D.1.7) 


where the equality applies when x = ay, with a being a (real- or complex-valued) scalar. 
6. The angle 6 between two vectors is defined as 


xy 


cos 9 = ————_ 
xl - lly 


(D.1.8) 


D.2.) MATRICES 


A rectangular array of N x M complex numbers ordered in N rows and M columns is 
called a matrix and is denoted by capital boldface letters, for example, 


A=[ajx] 1<i<N,1<k<M (D.2.1) 


Any linear transformation from space C% into space C™ can be represented by a suitable 
N x M matrix, if two bases in C% and C™” are already defined. Linear transformations from 
space C into space C% are given by square N x N non-singular matrices, in which case 
the transformation can be considered as a change of basis. We consider square matrices for 
the following development. 


D.2.1 Some Definitions 


1. A system of linearly independent vectors e;, e2,..., €y ina complex space CN is called 
a basis for CN if it is possible to express any vector x € C% by means of N coefficients 
Qa{,a2,...,aN as 


N 
xX=ajej +a2@2 +:--+ayen = Sak (D.2.2) 
i=l 
If a vector has the components x1, x2,..., xy in a given basis, then the linearly trans- 


formed vector y has components 


Vp = 4X1 +++ + atnxXN 


(D.2.3) 
YN = GnixX1 + +++ + a4nNXN 
in the basis defined by the transformation 
41 o-'* Gin 
a21 +++ GIN 


Aa[Or cS (D.2.4) 


GN1 *** GNN 


This transformation can be expressed, using the well-known row-by-column product 
between matrices and vectors, as 


y = Ax (D.2.5) 
. The transformation from y to x is called an inverse transformation, which is again linear. 
It is written as 

x=Aly (D.2.6) 
where A~! (if it exists) is a matrix and is known as an inverse of A, defined in (D.3.5). 


. The transformation that leaves unchanged the vector basis is said to be the identity 
transformation, and the related matrix is indicated generally by I, which is given by 


1 0G a2 6 
Ot 0 a6 

I=|0 01: 0 (D.2.7) 
000: 1 


. Two linear transformations of C into itself can be applied to a vector, obtaining a third 
transformation, called the product transformation 


y=Ax z= By = B(Ax) = (BA)x >z=Cx (D.2.8) 


where the matrix C is the product of B and A. In general, the matrix product is not 
commutative, that is, AB 4 BA. 

. The operation of transposition of a matrix inverts the orders of rows and columns; that 
is, element a;; takes the place of a;; in the new matrix. Similarly, the conjugate transpose 
of a matrix A is a matrix in which element ai; takes the place of a;;. The operations of 
conjugation and transposition are commutative, that is, 


A” = (A*)! = (A’)* (D.2.9) 


. A matrix norm ||A|| satisfies the following properties: 


a. ||A|| > 0 for A 4 0 and |/0|| = 0. 

b. ||\aA]| = |@|||A|| for any complex number a. 

c. ||A +B] < ||A|| + ||B|| (triangle inequality). 

d. ||AB|| < ||A]| ||Bl], which is needed because the matrix multiplication operation 


creates new matrices. 


An important matrix norm is the Frobenius norm, defined as 


N N 
lAlle = | 0D laiel? (D.2.10) 
i=1 k=1 
which treats the matrix as a “long vector.” Using any vector p norm, we can obtain the 
matrix norm 
Alp 2 max A*le (D.2.11) 
x40 ||x|| P 
which measures the amplification power of matrix A. The matrix norm for p = 2 is 
known as the spectral norm and is of great theoretical significance, and it is simply 
denoted by ||A||. When a matrix acts upon a vector x of length ||x|| ,, it transforms x into 
vector Ax of length ||Ax'|| ,. The ratio ||Ax|| »/||x|| p provides the magnification factor of 
the linear transformation Ax. The number ||A||, is the maximum magnification caused 
by A. Similarly, the minimum magnification due to A is given by 


Ax 
min |A|p = min IAXtIp 
x40 Ixllp 


(D.2.12) 
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13. 


14. 


and the ratio ||A||»/ min |A|, characterizes the dynamic range of the linear transforma- 
tion performed by matrix A. This interpretation provides a nice geometric picture for the 
concept of condition number (see Section D.3.2). 


. A matrix A is called Hermitian if 


AFA (D.2.13) 


and a Hermitian form H (x, x) is the second-order real homogeneous polynomial 


N WN 
H(x,x)= °C hinxfxe hin = hi (D.2.14) 
i=l k=1 


. A real-valued matrix A is called symmetric if A’ = A and a quadratic form Q(x, x) is 


the second-order real homogeneous polynomial 


N WN 
Ox, x) = SOO hinxixe hig = hei (D.2.15) 


i=1 k=1 


. Matrix L is called a lower triangular matrix if all elements above the principal diagonal 


are zero. Similarly, matrix U is called an upper diagonal matrix if all elements below 
the principal diagonal are zero. 


. The trace of a matrix is the sum of the elements of its principal diagonal, that is, 


N 
tr(A) = Ne dij (D.2.16) 
i=1 


with the property tr(AB) = tr(BA) = tr(A”B”) (D.2.17) 


for any square matrices A and B. 


. A diagonal matrix is a square N x N matrix with a;; = 0 fori # j; that is, all elements 


off the principal diagonal are zero. It appears as 


ai, O --+ 0 
A= : eo an, : (D.2.18) 
Os On <tse. “Bate 
. A Toeplitz matrix is defined as 
A=[ajx]=[ai-k] 1<i<N,1<k<M (D.2.19) 


A square Toeplitz matrix appears as 


ag a, a2 a\_Nn 
a ag a_| a2_N 

A=]q a ao “+ GN (D.2.20) 
QN-1 4N-2 G4N-3  *** AO 


A matrix is called persymmetric if it is symmetric about the cross-diagonal, that is, 
Qj =4n—j4i,n-it1, 1SiSN,1SJ SN. 
The exchange matrix J is defined by 


0. 0 0 1 
0. 0 1 0 
0. 1 0 0 


— 
j=) 
j=) 


and has the following properties 


eA 

JA = flipud (Aa) 

AJ = fliplr (A) 
where the MATLAB functions flipud(A) and fliplr(A) reverse the order of rows and 
columns of a matrix A, respectively. 

15. A matrix is called centrosymmetric if it is both symmetric and persymmetric. It can be 
easily seen that a centrosymmetric matrix has the property J7 AJ = A when A is real or 
J’ AJ = A* when A complex. 

16. A matrix is called Hankel if the elements along the secondary diagonals, that is, the 
diagonals that are perpendicular to the main diagonal, are equal. If A is Hankel, then JA 
is Toeplitz. 

17. The inverse of a triangular, symmetric, Hermitian, persymmetric and centrosymmetric 
matrix has the same structure. The inverse of a Toeplitz matrix is persymmetric and the 
inverse of a Hankel matrix is symmetric. 

18. Apartition of an N x M matrix A is anotational rearrangement in terms of its submatrices. 
For example, a 2 x 2 partitioning of A is 


Aut Re 
k= (D.2.21) 
Fe Ax 


where each “element” Aj; is a submatrix of A. 


D.2.2 Properties of Square Matrices 


1. The operations of transposition T, conjugation «, or both H are distributive, that is, 
(A+B)? =A? +B? 
(A + B)* = A* + B* (D.2.22) 
(A+B)4 =A? + BF 
2. For the operators T, H, or —1 (inversion), we have 
(AB)? = BTA? 
(AB)# = BH A# (D.2.23) 
(AB)-! = B“!A7! 
3. The operators *, 7, H, and —1 are commutative, for example, 
(A%) b= (A714 (D.2.24) 


Thus we can use the compact notation AW! or A *, etc. 
4, Given any matrix A, matrix B = A” A is Hermitian [see (D.2.13)] and if A is invertible, 
then for such a B, we have 


A-?BA7!=AFAZ AAT! =I (D.2.25) 
5. If H is the matrix of the coefficients h;;, the Hermitian form (D.2.14) can be written as 
H (x, x) = x" Hx = (x, Hx) (D.2.26) 


Similarly, If H is the real-valued matrix of the coefficients h;,, the quadratic form (D.2.15) 
can be written as 


H (x, x) = x! Hx = (x, Hx) (D.2.27) 
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6. A Hermitian matrix A is called 


Positive definite if x4Ax > 0 
Positive semidefinite if x4 Ax > 0 (also nonnegative definite) (D.2.28) 
Negative definite if x4 Ax <0 = 
Negative semidefinite if x” Ax <0 (also nonpositive definite) 
for allx £0. 
7. The operation of the trace of a matrix satisfies 
tr(A + B) = tr(A) + tr(B) (D.2.29) 
tr(KA) = ktr(A) (D.2.30) 
tr(AB) = tr(BA) (D.2.31) 
tr(B-! AB) = tr(A) (D.2.32) 
N oN 
tr(AA”) = S° Jail? (D.2.33) 
i=1 j=l 
D3. DETERMINANT OF A SQUARE MATRIX 
The determinant of a square matrix A is denoted by 
a1 412 ++ GIN 
det(a) 4 [71 2 2N (D.3.1) 
4n1 4n2 °°: GNN 


and is equal to the sum of the products of the elements of any row or column and their 
respective cofactors, that is, 


det(A) = aj Ci) + aj2Ci2 + +++ + ainCin (D.3.2) 

or det(A) = aypCix + ary Cre +--+ + aneCnk (D.3.3) 
where the C;x are called cofactors, given by 

Cip = (—1)'** det(Ajx) (D.3.4) 


where Ajx is an (N — 1)st-order square matrix obtained by deleting the ith row and kth 
column. Thus the determinant needs to be computed recursively; that is, the Nth-order 
determinant is computed from the (NV — 1)st-order determinant, which in turn is computed 
from the (N — 2)nd-order, and so on. If det(A) 4 0, then the inverse A~! of A exists and 
is unique. The A~! matrix is given by 


Ci Cop ss Cnt 
1 C C oe CO 
Ste. C12 22 Cw2 (D3.5) 
det (A) |: ee. 3 
Cin Con --: Cnn 


D.3.1 Properties of the Determinant 


Below we provide some useful properties of the determinant. 


1. If arow (or column) of a matrix is a linear combination of other rows (or columns), then 
det(A) = 0. In particular, if (a) a row (or column) is proportional or equal to another 
row (or column) or (b) a row (or column) is identically zero, then det(A) = 0. 


2. If two rows (or columns) are exchanged with each other, then the determinant changes 


its sign. 


3. For a triangular matrix (upper or lower) A, the determinant is obtained by multiplying 


all the elements of its principal diagonal, that is, 
N 

det(A) = I] ae 
n=1 


4. The det(A) is unchanged if A is replaced by its transpose A/ ; that is, 
det(A) = det(A7) 


(D.3.6) 


(D.3.7) 


5. Using the above property, we also claim that the determinant of a Hermitian matrix is 


real, since 


det(A) = det(A”) = det(A’) => det(A) = det(A*) = det(A)* 


(D.3.8) 


6. The determinant of a product of matrices is the product of their determinants; that is, 


det(AB) = det(A) det(B) 


7. If matrix A is nonsingular, that is, its inverse A! exists, then 


= ee = 
det(A~!) = [det(A)]7! = PETTY 


8. Given an arbitrary constant c (possibly complex-valued), we have 


det(cA) = c™ det(A) 


D.3.2 Condition Number 


(D.3.9) 


(D.3.10) 


(D.3.11) 


One of the important equations in signal processing is the linear equation Re = d, 
where R is a matrix of known values, d is a vector of known quantities, and c is a vector 
of unknown coefficients. The investigation of how the solution of Re = d is affected by 
small changes (perturbations) in the elements of R and d leads to an important characteristic 


number of matrix R, called the condition number. 


If vector d is perturbed to d + 6d, the exact solution c is perturbed to ¢ + dc. Therefore, 


R(c + 6c) =d+é6d 


which implies that dc =R°'3d__ since Re =d 
or using property 4 of matrix norm 
l|5el] < IR" || Id] 


From the same norm property and d = Re, we obtain 


IId|| < [RI llell 
Multiplying (D.3.13) by (D.3.14) and solving, we obtain 
I|5e|| 1, llédll 
7 <IRINIRO I 
Ile \|d| 


Similarly, keeping d constant and perturbing R to R + 5R, we have 
(R+ 6R)(c+ 6c) =d 
from which, after ignoring the second-order product term dR dc, we obtain 
|| 5e|| [SRI 


——- <||RI| ||R7'|| —— 
llell IR\ 


(D.3.12) 


(D.3.13) 


(D.3.14) 


(D.3.15) 


(D.3.16) 
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A careful inspection of (D.3.15) and (D.3.16) shows that the relative error in the exact 
solution is bounded by the number 


cond (R) £ ||R{j |R7! || (D.3.17) 


which is known as the condition number of matrix R, multiplied by the relative perturbation 
in the data (R or d). When relatively small perturbations in R cause relatively small (large) 
perturbations in the solution of Re = d, matrix R is said to be well (ill) conditioned. 
Clearly, ill-conditioned matrices have large condition numbers, and therefore their large 
magnification power amplifies small perturbations to the extent that makes the obtained 
solution totally inaccurate. 

Since the norm of the identity matrix ||I|| = 1, we have 


(| = |RRo'|| < (Ri) |R7'|| = cond (R) 
that is, cond (R) > 1 (D.3.18) 


The best possible condition number is 1. 


D.4_ UNITARY MATRICES 


A matrix A is called a unitary matrix if its inverse is equal to its conjugate transpose, that 
1S, 

AT=A% SAF A=TI (D.4.1) 
For a real-valued matrix, A is called an orthogonal matrix if its inverse is equal to its 
transpose, that is, 


A'T=A’SATA=I (D.4.2) 
If we write the unitary matrix A as a set of N column vectors, that is, 
A = [a| a --: ay] (D.4.3) 
then we can show that 
1 ix=k 
H A 
: = = 6; D.4.4 
aj ax | Geek ik ( ) 


that is, the column vectors of a unitary matrix are orthonormal. 

A transformation is called a unitary transformation if the transformation matrix is 
unitary. Vector inner products, vector norms, and angles between two vectors are invariant 
(i.e., they are preserved) under unitary transformation. Thus given two vectors x and y and 
a unitary matrix A, we have 


(x, y) = (Ax, Ay) (D.4.5) 

and IIx]? = ||Ax||? (D.4.6) 
This implies that the absolute value of the determinant of a unitary matrix is unity, or 

|det(A)| = 1 A unitary (D.4.7) 


since from (D.4.1), (D.3.9), and (D.3.8), we have 
det(I) = det(A” A) = det(A”) det(A) = det(A)* det(A) = |det(A)|? =1 — (D.4.8) 


D.4.1 Hermitian Forms after Unitary Transformations 


Let H(y, y) = (y, Ry) = y’Ry be an arbitrary Hermitian form for any matrix R. Define 
a transformation y = Ax for any unitary matrix A. Then we can write H(y, y) as 


H(y, y) = x" A“ RAx = x" Px (D.4.9) 


where P=A"RA=A7'RA (D.4.10) 
Matrix R can be reduced to a diagonal form by unitary transformation 
U* RU = A = diag(A1, Az, ..., An) (D.4.11) 
Hence the Hermitian form H(y, y) can be written as 
H(y, y) =y" Ry = y“UAUSy = x” Ax = (x, Ax) (D.4.12) 


where x = Ay = U"y. Therefore, we can write 


H(y,y) = 3 > rikYE Yk = : Aalxil? (D.4.13) 


i=l k=1 


D.4.2 Significant Integral of Quadratic and Hermitian Forms 


Consider a quadratic form Q(x, x) = (x, Ax). The indefinite integral of the exponential of 
Q(x, x) is given by 


lee) lee) 
Ivy = i a i exp(—x! Ax) dx (D.4.14) 
—0 —o 


where dx = dx, dx2---dxy, and it has many applications. Using (D.4.12) and (D.4.13) 
(specialized to the real case), we obtain 


(x, Ax) = (y, Ay) = Yiot (D.4.15) 


where A;,i = 1,2,..., N, are eigenvalues of A. Thus (D.4.14) becomes 


igs i | exp|—) oaiy; | dy = IT / exp(—Ajy?) dy; (D.4.16) 
—00 —00 i=l i=1° © 
Now by using the result 
cd 1 
/ exp(—ax?) dx = ,/— (D.4.17) 
fas a 
Equation (D.4.14) becomes 


N oe) N 
4 IU 
I =|] = D.4.18 
‘ Nf: V Aig Aw as 


Finally, using the fact that det(A) = Wy 1 Ai, we obtain 


1 N 
det(A) 


The result in (D.4.19) can be extended to the complex case. Let H (z, z) be the Hermitian 
form of a complex-valued vector z = x+ jy. Then the indefinite integral of the exponential 
of H(z, z) is given by 


(D.4.19) 


[o@) lee) 
Jn & if ee i exp(—z! Az) dz (D.4.20) 
—oo —oo 


=o (D.4.21) 


where dz = dx, dx2---dxy dy, dy2---dyy. Thus sometimes we get slightly different 
results for the complex case. 


763 


SECTION D.4 
Unitary Matrices 


764 


APPENDIX D 
USEFUL RESULTS FROM 
Matrix ALGEBRA 


TABLE D.1 
Summary of properties of vectors and matrices in real and complex spaces. 


Real versus Complex 


C¥: N-dimensional complex space 
Norm: [|x|]? = [xy |? +--+ + [xwl? 
Hermitian: A? = [a5] 
(AB)? = BY AH 

Inner product: (x, y) = xy 


RY: N-dimensional Euclidean space 
2 2 
pt ty 
Transpose: AT = [aji] 
(AB)? = BAT 


Inner product: (x, y) = xly 


Norm: \Ix||? =x 


Orthogonality: xly =0 Orthogonality: xf y=0 


Symmetric matrices: A = AT Symmetric matrices: A = Ad 
Orthogonal matrices: Q’ =Q'! Unitary matrices: ur =u"! 

A= QAQ7! = QAQ7? (real A) A =UAU7! = UAU~F (real A) 
Norm invariance: ||Qx|| = ||x|| o Norm invariance: ||Ux|| = ||x|| 


(Qx)" (Qy) = xy (Ux)4 (Uy) = xy 


$+ tt ttt 


aN 
v 


t 


Table D.1 summarizes various properties described above as they relate to both complex- 
valued and real-valued matrices. 


D.5 POSITIVE DEFINITE MATRICES 


Positive definite matrices play an important role in signal processing in general and least- 
squares (LS) estimation in particular, and they deserve some attention. A conjugate sym- 
metric M x M matrix R is called positive definite if and only if the Hermitian form 


M 
x7Rx =) rijxfxj > 0 (D.5.1) 
i,j 


for every x 4 0. For example, the symmetric matrix 


2 -l 0 
R=]/-1 2 -1 (D.5.2) 
0 -l 2 


is positive definite because the quadratic form 
xP Rx = x? + (x1 — x2)? + (x2 — 23)? + x7 > 0 (D.5.3) 


can be expressed as a sum of squares that is positive unless xj = x2 = x3 = 0. 

From this simple example it is obvious that using the definition to find out whether 
a given matrix is positive definite is very tedious. Fortunately, use of this approach is not 
necessary because other criteria can be used to make a faster decision (Strang 1980; Horn 
and Johnson 1985; Nobel and Daniel 1988). We next summarize some positive definiteness 
tests that are useful in LS estimation. 


Positive definiteness criterion 


An M x M matrix R is positive definite if and only if it satisfies any one of the following 
criteria: 


1. x? Rx > 0 for all nonzero vectors x. 
2. All eigenvalues of R are positive. 
3. All principal submatrices R,,, | < m < M, have positive determinants. The principal 


submatrices of R are determined as follows: 


Ti 1i2 113 
rit | ie 
3 


R; = [ri] R =| = [ro 122 123 + Ry=R 


r21 122 

(D.5.4) 
It is important to stress that this criterion applies also to the lower right submatrices or 
any chain of submatrices that starts with a diagonal element 7;; as the first submatrix and 
then expands it by adding a new row and column at each step. 


. There exists an L x M, M > L, matrix S with linearly independent columns such that 


R = S/S. This requirement for the columns of S$ to be linearly independent implies that 
S has rank M. 


. There exists a nonsingular M x M matrix W such that R = WW. The choices for 


the matrix W are a triangular matrix obtained by Cholesky’s decomposition (see Section 
6.3) or an orthonormal matrix obtained from the eigenvectors of R (see Section 3.5). 


. There exists a nonsingular M x M matrix P such that the matrix PRP is positive 


definite. 


Properties of positive definite matrices. A positive definite matrix R has the following 


properties: 


NNPWN KE 


. The diagonal elements of R are positive. 

Pegs rel OA) 

. The element of R with the largest absolute value lies on the diagonal. 

. The det R > 0. Hence R is nonsingular. 

. The inverse matrix R™! is positive definite. 

. The matrix obtained by deleting a row and the corresponding column from R is positive 


definite. 
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In this appendix we prove a theorem that provides a test for checking if the zeros of a 
polynomial are inside the unit circle (minimum phase condition) using the lattice parameters. 
The required lattice parameters can be obtained from the coefficients of the polynomial using 
the algorithm (2.5.28) in Section 2.5. 


THEOREM E.1. The polynomial 


Ap(z) =14+atPz-1 4... az? (E.1) 
is minimum-phase, that is, has all its zeros inside the unit circle if and only if 
lkm| <1 l<m<P (E.2) 


Proof. We will prove the sufficiency part first, followed by the necessary part. Also we will 
make use of property (2.4.16)-(2.4.17) of the all-pass systems. 


Sufficiency. We will prove by induction that if |km| < 1,1 < m < P, then Ap(z) is 
minimum-phase. For P = 1 we have 


Ai) =14aPz} = 1427! 


Clearly if |kj| < 1, then Aj(z) is minimum-phase. Assume now that A,,—1(z) is minimum- 
phase. It can be then expressed as 


m—1 
p= 
Am-1@) = [[a-z2" 227} (E.3) 
i=1 
(m—1) F 
where Iz; | <1 l<i<m-1l (E.4) 
However, from the recursion (2.5.9), 
Am (2) = Am—1(2) + km27! Bm—1(2) (E.5) 
Hence 
1 
Am (Zo) = Am—1(20) + kin (sm) m1) =0 l<i<m 
z: 
(m) 
(Zz; °) 
or lin = fe : ay Cl Sism (6) 
C/z; )Bm—1(Z; ) 
But 


m—1 


Bm—1@) =z" Ag_1@}) =<) TT a- re D. 
i=1 
(E.7) 


m1 ; 
= I] (7! ager 5) 


i=1 
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Since A”—) (z) has real coefficients, either its zeros are real or they appear in complex conjugate 
pairs. Thus, a zero and its complex conjugate can always be grouped in the numerator and 
denominator of (E.6) as 


m-—-1 2 


kml = [| 


i=1 


(m=1) (mn) 
Be (2 ej<m (E.8) 


1/2 se Cae 


Applying property (2.4.17) to every factor of (E.8), with a = ares gives 
<1 |g) <1 
lkmly=1 |) =1 (E.9) 
>t [11 


Thus, if |z”| < 1,1 <i <m, then |ky| <1. 


Necessity. We will prove that if A p(z) is minimum-phase, then |kj,| < 1,1 <m < P. 
To this end we will show that if A,,(z) is minimum-phase, then |k,,| < 1 and A,,—4(z) is 
minimum-phase. From 


Am(z) =] [a - 2271) 
i=l 


we see by inspection that the coefficient of the highest power z~” is 
m 
km = on” = T] zi”) (E.10) 
i=1 
m 
Thus, lkm| < [] lef) <1 (E.11) 


i=1 
To show that A@”—D (z) is minimum-phase, we recall that 


Am(Z) — km Bm (z) 


Am— = E.12 
m—1(Z) (ieee ( ) 
If gor is a zero of A“"—))(z), we have 
(m—1) (m—1) 
_ Am(Z; )=— kn Bm &; ) 
1 m mDm 
Amie") = ! aT SP i =0 (E.13) 
If |km| 4 1, then (E.13) implies that 
(m—-1) 
Am(z: 
ipa tiene (E.14) 
Bm (Z; ) 


Applying again the property (2.4.17) to lkm|? in (E.14) shows that since |km| < 1, then 


er <1 for] <i <m-—1.Hence, AM) (z) is minimum-phase. 
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properties, 57 
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Burg’s lattice method, 459-460 
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frequency domain interpretation, 455 
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impulse train excitations, 173 
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second-order, 176 
spectrum, 173 
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alternative adaptation gain vector, 550 

amplitude distribution, 8 

amplitude-domain LS solutions, 439 

analysis filter, 152 
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maximum likelihood, 679-680 

angle normalized errors, 585 

angle of arrival, 625 

antialiasing filter, 197 

aperture, 622 

ARMA (P,Q) models, 445 

array element spacing, 635 

array gain, 635 

array output signal-to-noise ratio, 633 

array pre-steering, 651 

array processing, 25 

array response vector, 629, 630 

for ULA, 629 

array signal model, 627 

array signal-to-noise ratio, 633 

array snapshot, 627 

autocorrelation, estimation, 209 

autocorrelation matrix, 85 

autocorrelation method, 408 

autocorrelation sequence, 53, 100 

autocovariance matrix, 86 

autocovariance sequence, 100 

autoregressive (AR) signal models, 154 

autoregressive fractionally integrated 
moving-average (ARFIMA) models, 722 

autoregressive integrated moving-average 
(ARIMA) models, 184 

autoregressive models, 164 

autoregressive moving-average models, 179 

autoregressive moving-average (ARMA) signal 
models, 154 

azimuth angle, 624 


backward linear prediction, 289 
backward prediction, Levinson recursion, 348 
bandpass signal, 626 
baud interval, 311 
baud rate, 311 
beam response, 632 
beamforming, 25, 631 
beamforming gain, 633 
of spatial matched filter, 635 
beamforming resolution, 636 
beampattern, 25, 27, 632 
beamspace, 651 
beamsplitting, 678, 681 
beamwidth, 28, 635 
best linear unbiased estimator (BLUE), 405, 752 
bispectrum, 693 
blind deconvolution, 306, 697 
blind equalization 
cyclostationary methods, 704 
HOS-based methods, 704 
blind equalizers, 702 
Bussgang algorithms, 706 
constant-modulus algorithm, 707 
fractionally spaced (FSE), 713 


Godard algorithms, 707 
Sato algorithms, 706 
symbol rate, 705 
blind interval, 669 
block adaptive filtering, 511 
block adaptive methods, 659 
block LMS, 546 
Brownian motion, 727 
Bussgang processes, 706 


Capon’s method, 472 
carrier frequency, 624 
Cauchy-Schwarz inequality, 135 
central limit theorem (CLT), 90, 95 
centrosymmetric matrix, 759 
cepstral distance, 188 
cepstrum, 63, 152 
all-pole models, 185 
all-zero models188 
pole-zero models, 184 
channel equalization, 20 
characteristic exponent, 94 
characteristic function, 79 
Chebyshev’s inequality, 79 
chi-squared distribution, 140 
Cholesky decomposition, 278, 560 
close to Toeplitz, 408 
clutter, 7, 27 
clutter cancelation, 683 
coefficient vector, 265, 279 
coherence, 113 
coherent output PSD, 242 
coloring filter, 152 
complex coherence function, 238 
complex cross-spectral density, 113 
complex envelope, 45 
complex spectral density, 54, 113 
condition number, 762 
conditional covariance, 405 
conditional density, 274 
conditional mean, 405 
cone angle, 625 
confidence interval, 136 
confidence level, 136 
constrained optimization, 644, 650 
conventional beamforming, 27, 634 
conventional RLS, 548 
conventional RLS algorithm, 552 
initialization, 554 
convergence everywhere, 513 
convergence in MS sense, 513 
convergence mode, 507 
convergence with probability, 1, 513 
conversion factor, 550 
correlation, 86 
correlation, properties, 114 
correlation coefficient, 86 
correlation matrix 
stationary processes, 123 
random processes, 123 
correlation matrix properties, 120 


correlation sequence, 53 
cospectrum, 238 
estimation, 240 
covariance filtering-type algorithms, 572 
covariance method, 408 
Cramer-Rao bound (CRB) on angle accuracy, 680 
Cramer-Rao lower bound, 135 
criterion autoregressive transfer (CAT) function, 
458 
criterion of performance (COP), 24 
cross-amplitude spectrum, 238 
estimation, 240 
cross-correlation matrix, 87 
cross-correlation sequence, 100 
cross-covariance matrix, 87 
cross-covariance sequence, 100 
cross-periodogram, 239 
cross-power spectral density, 113 
cross-validation, 449 
cumulant generating functions, 80 
cumulant spectra, 692 
cumulants, 80, 692 
cumulative distribution function (cdf), 76 
cumulative-sum method, 735 


data, 261 
data matrix 
full-windowing, 450 
no-windowing, 450 
data window, 203 
data-adaptive spectrum estimation, 472 
decomposition of the covariance rule, 406 
deconvolution, 306 
degree of nonstationarity, 595 
desired response, 261 
deterministic signals, 2 
DFT sampling theorem, 201 
DFT; see Discrete Fourier transform 
diagonally loaded sample correlation matrix, 666, 
668 
difference beamformer, 680 
difference equations, 49 
digital in-phase/quadrature (DIQ), 627 
direct error extraction, 424 
directivity, 622 
Dirichlet conditions, 38 
discrete cosine transform, 547 
discrete Fourier transform, 42 
discrete fractional Gaussian noise (DFGN), 733 
generation, 735 
memory, 734 
self-similarity, 734 
discrete Karhunen-Loeve transform (DKLT), 130 
discrete prolate spheroidal sequences (DPSSs), 247 
discrete spectrum, 37 
discrete wavelet transform, 547 
discrete-time fractional Gaussian noise, 719 
discrete-time fractional pole noise, 719 
discrete-time signal, | 
discrete-time stochastic processes, 97 
discrete-time systems, 47 
dispersion, 656, 707 


dispersion matrix, 657 

dispertion, 502 

displacement rank, 389 

Dolph-Chebyshev taper, 639, 649, 650 

doppler effect, 7 

Durbin algorithm; see Levinson-Durbin algorithm 


echo, 500 
echo cancelation, 538 
echo canceler, 501, 539 
adaptive, 502 
fixed, 501 
echo cancelation, communications, 500 
echo path, 17, 501, 539 
echo return loss enhancement, 502 
echo suppressor, 501 
echoes, 17 
acoustic, 500 
electrical, 500 
line, 500 
eigenbeam, 647 
eigenfilters, 319 
eigenmatrix, 122 
eigenvalue spread, 124 
eigenvalues, 120 
eigenvector method, 484-485 
eigenvectors, 120 
electrophysiological signals, 4 
elevation angle, 624 
empirical autocorrelation, 10 
energy spectrum, 38, 39 
equalization, 310 
data communications, 502 
equalizers 
fractionally spaced (FSE), 709 
MMSE FSE, 713 
equation-error method, 463 
error performance surface, 266 
error signal, 262 
ESPRIT algorithm, 488-493 
least squares method, 491-492 
total least squares method, 492-493 
estimation error, 515 
estimation misadjustment, 596 
estimation noise, 596 
estimator, 133 
bias, 134 
consistency, 136 
MSE, 134 
variance, 134 
Euclidean norm, 756 
Euclidean space, 755 
evolutionary model, 592 
exchange matrix, 758 
excess MSE, 269, 596 
expected value, 77 
exponential convergence, 518 
exponential memory, 593 
exponential sequence, 35 
exponentially growing window, 591 
extended QR-RLS algorithm, 565 
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extended Schiir algorithm, 372 
extrapolation, 198 
eye pattern, 312 


FAEST algorithm, 576, 578 
far field assumption, 624 
far-end echo, 538 
fast fixed-order RLS, 574 
fast Fourier transform (FFT), 43 
fast Kalman algorithm, 575, 576 
fast order-recursive RLS, 574 
fast RLS algorithms, 573 
features of adaptive filters, 23 
FFT; see fast Fourier transform 
filtering structure, 24 
final prediction error (FPE) criterion, 458 
finite impulse response (FIR) filter, 50 
FIR; see finite impulse response 
fixed-length sliding window, 591 
forgetting factor, 548 
formant frequencies, 4 
forward linear prediction, 288 
forward prediction, Levinson recursion, 349 
forward-backward linear prediction, 413 
forward/backward LS all-pole modeling, 467 
forward-backward predictors, 454 
Fourier series, 37 
Fourier transform, 37, 38 
fractal models, 14 
fractals, 15 
fractional autorregressive integrated 
moving-average models, 14 
fractional bandwidth, 628, 656 
fractional Brownian motion, 15, 730 
fractional differentiator, 719 
fractional Gaussian noise, 15 
fractional integrator, 14, 719, 732 
fractional pole processes 
Gaussian, 723 
SaS, 723 
fractional pole systems, continuous-time, 732 
fractional pole-zero models, 14, 721 
fractional unit-pole models 
autocorrelation, 718 
definition, 716 
impulse response, 717 
memory, 719 
minimum-phase, 718 
partial autocorrelation, 719 
spectrum, 718 
fractionally differenced Gaussian noise, 14 
fractionally spaced equalizer (FSE), 315 
frequency analysis, 198 
frequency estimation, 478-493 
eigenvector method, 484-485 
ESPRIT algorithm, 488-493 
MUSIC algorithm, 484-485 


Pisarenko harmonic decomposition, 482-484 


root-MUSIC algorithm, 485 
frequency response, estimation, 241 
frequency response function, 49 
Frobenius norm, 433, 757 


Frost’s algorithm, 671 

FTF algorithm, 577 

full-duplex data transmission, 538 
fundamental frequency, 37 


Gaussian moment factorization property, 529 
general linear process model, 151 
generalized sidelobe canceler (GSC), 650, 670 
genetic optimization algorithms, 608 
geophysical signals, 5 
Gerschgorin circles theorem, 530 
Givens inverse QR-RLS algorithm, 569, 571 
Givens QR-RLS algorithm, 566, 568 
Givens rotation, 427 
gradient, 747 
Gram-Schmidt, classical algorithm, 346 
Gram-Schmidt orthogonalization, 345 
classical, 429 
modified, 430 
grating lobes, 636 
growing memory, 548 


Hankel matrix, 759 
harmonic fractional pole models, 723 
harmonic fractional pole-zero models, 723 
harmonic model(s), 184, 478-482 
harmonic processes, 110 
harmonic spectra, 39 
harmonizable representation, 733 
Haussdorff dimension, 732 
Hermitian matrix 

negative definite, 760 

negative semidefinite, 760 

positive definite, 760 

positive semidefinite, 760 
Hessian matrix, 524 
high resolution spectral estimator, 472 
higher-order moments 

definitions, 691 

linear signal models, 695 

linear system response, 693 
higher-order statistics, 691 
Householder reflections, 425 
hybrid couplers, 538 
hybrids, 500 


idempotent matrix, 402 

IIR; see infinite impulse response 

IIR adaptive filters, 608 

implementation complexity, 515 

impulse response, 47 

incremental filter, 595 

independence assumption, 528 

index of stability; see characteristic exponent 
infinite impulse response (IIR) filter, 51 
infinitely divisible distributions, 95 
information filtering-type algorithms, 572 
initialization, CRLS algorithm, 554 

inner product, 755 

innovations, 125 


innovations representation, 151 
eigendecomposion approach, 129 
LDU triangularization approach, 129 
UDL traingularization approach, 129 

in-phase component, 45 

input data vector, 265, 279 

interference, 7, 638 

interference mitigation, 27 

interference signal, 642 

interference subspace, 647 

interference-plus-noise correlation matrix, 642, 

647 

intersymbol interference (IS), 20, 310 

inverse filtering, 306 

inverse QR-RLS algorithm, 566 

inverse Schiir algorithm, 374 

inverse system, 54 

invertibility, 54 

isotropic transformation, 126 

Itakura-Saito (IS) distortion measure, 457 

Itakura-Saito distance measure, 462 


Jacobian, 87 

jammer, 641, 645 

jamming, 27 

joint cumulative distribution function, 83 
joint ergodicity, 107 

joint signal analysis, 11 


Kalman filter, 378, 592 
algorithm, 384 
gain matrix, 382 
measurement model, 381 
observation error, 381 
observation model, 381 
signal model, 381 
state transition matrix, 381 
state vector, 381 
Kalman gain matrix, 382 
Karhunen-Loeve transform, 129 
Kolmogorov-Szego formula, 305 
Kullback-Leibler distance, 458 
kurtosis, 79 


lag error, 515 

lag misadjustment, 596 

lag noise, 596 

Lagrange multipliers, 749 

Lagrangian function, 750 

lattice filters, 64 
all-pass, 70 
all-pole, 68 
all-zero, 65 

lattice parameter conversion 
direct-to-autocorrelation, 367 
direct-to-lattice, 366 
lattice-to-autocorrelation, 367 
lattice-to-direct, 366 

lattice parameter estimation 
Burg’s method, 459-460 
Itakura-Saito method, 460 


lattice-ladder optimization, 365 
lattice-ladder structure, 351 
law of iterated expectations, 406 
LDL? decomposition, 274 
leading principal submatrix, 335 
leakage, 204 
leaky LMS, 546 
learning curve, 514, 519 
least-squares 
amplitude-domain techniques, 424 
comparison with MSE estimation, 419 
data adaptive estimators, 438 
FIR filters, 420 
linear prediction, 411, 420 
minimum-norm solution, 435 
normal equations solution, 416 
orthogonalization techniques, 422 
power-domain techniques, 424 
rank-deficient, 437 
regularization, 438 
regularized solution, 438 
signal estimation, 411 
square root methods, 424 
SVD solution, 434 
least-mean-square (LMS) algorithm, 524 
least-squares error (LSE) estimation, 395 
least-squares FIR filters, 406 
least-squares inverse filters, 409 
least-squares principle, 395 
left singular vectors, 432 
Levenberg-Marquard regularization, 465 
Levinson, 278 
Levinson algorithm, 353, 358 
Levinson recursion, 338 
Levinson-Durbin algorithm, 356 
Levy distribution, 10, 95 
Levy stable motion, 739 
likelihood variable, 551 
line spectrum, 37, 461, 479 
linear equalizers, 314 
linear LSE estimation, 396 
data records, 397 
estimation space, 399 
normal equations, 399 
snapshots, 397 
statistical properties, 403 
uniqueness, 401 
weighted, 403 
linear mean square error estimation, 264 
linear MMSE estimator, 265 
derivation, 268 
linear prediction, 21, 286 
backward, 289 
forward, 288 
linear prediction coding (LPC), 21, 470, 503 
linear random signal model, 12, 151 
linear signal estimation, 286 
linear systems 
frequency-domain analysis, 117 
input-output cross-correlation, 116 
output correlation 
output mean value, 116 
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output power, 117, 118 
random inputs, 115 
scale-invariant, 732 
time-domain analysis, 115 
linearly constrained minimum variance 
beamformer, 672 
LMS 
adaptation in stationary SOE, 526 
digital residual error, 546 
disturbances, 545 
finite precision effects, 546 
leakage, 546 
method of ordinary differential equations, 536 
misadjustment, 597 
MSD, 598 
rate of convergence, 534 
robustness, 543 
speed versus quality of adaptation, 534 
stability, 534 
steady-state excess MSE, 532, 534 
stochastic approximation approach, 536 
tap input power, 535 
transient MSE, 532 
LMS algorithm, 526, 533 
log likelihood function, 135 
long memory, estimation, 739 
long memory processes, 119 
long-tailed distributions, 107 
long-term persistence, 722 
look direction, 634 


magnitude square coherence, 113 
magnitude-squared coherence, 238 
estimation, 240 
Mahalanobis distance, 126 
mainbeam, 635 
marginal density function, 84 
Markov estimator, 405 
matched filters, 319 
mathematical expectation, 77 
matrix, 756 
amplification power, 757 
centrosymmetric matrix, 759 
column space of, 433 
condition number, 436, 762 
dynamic range, 758 
exchange matrix, 758 
Hankel matrix, 759 
Hermitian matrix, 760 
Hermitian, 758 
lower triangular, 758 
null space of, 433 
numerical rank of, 437 
orthogonal matrix, 762 
partition of a matrix, 759 
persymmetric matrix, 758 
positive definite matrix, 764-765 
range space of, 433 
relative error, 762 
row space of, 433 
square matrix, 760 
symmetric, 758 
Toeplitz, 758 
unitary transformation, 762 


unitary matrix, 762 
upper triangular, 758 
well (ill) conditioned matrix, 762 
matrix factorization lemma, 563 
matrix inversion by partitioning lemma, 336 
matrix inversion lemma, 745 
matrix norm, 757 
maximum entropy method, 460-461 
maximum likelihood estimate (MLE), 136 
maximum-phase, 293 
maximum-phase system, 56 
mean square deviation (MSD), 513, 596 
mean square error (MSE) criterion, 264 
mean value, 77 
mean vector, 85 
Mercer’s theorem, 122 
minimax criterion, 545 
minimum description length (MDL) criterion, 458 
minimum mean-square error (MMSE), 678 
minimum MSE equalizer, 316 
minimum-variance estimator, 405 
minimum-norm method, 485-488 
minimum-phase, 293 
test, 69 
minimum-phase system(s), 55 
properties, 61 
minimum-variance spectrum estimation, 471-478 
implementation, 474-477 
relationship to all-pole spectrum estimation 
471-478 
theory, 472-474 
minimum-variance distortionless response 
(MVDR), 644 
misadjustment, 514, 596 
mixed processes, 156 
mixed spectra, 39 
mixed-phase system, 56 
MMSE filtering, 652 
model, 11 
model fitting, 447 
model order selection criteria, 457-458 
Akaike information criterion, 458 
criterion autoregressive transfer function, 458 
final prediction order criterion, 458 
minimum description length criterion, 458 
modified covariance method, 414 
modulation, 625 
moment generating function, 80 
moments, 78 
central, 78 
monopulse radar, 680 
Moore-Penrose conditions, 435 
Moore-Penrose generalized inverse, 402 
moving-average models, 173 
moving-average (MA) signal models, 154 
multichannel adaptive filters, 608 
multiple linear constraints, 672 
MUSIC algorithm, 484-485 
MVDR beamformer, 644, 650 


narrowband assumption, 628, 656 
narrowband interference cancelation, 414 
narrowband steering vector, 656 


natural mode, 518 
near to Toeplitz, 408 
near-end echo, 538 
Newton’s type algorithms, 523 
noise cancelation, 505 
noise subspace, 480-481, 647 
nonharmonic spectra, 39 
nonlinear adaptive filters, 608 
nonparametric models, 12 
nonrecursive system representation, 150 
normal equations, 269 

solution, 274 
normalized cross-correlation, 100 
normalized frequency, 40 
normalized LMS, 535 
normalized LMS algorithm, 526 
normalized MSE, 269 
nulls, 27 
numerical accuracy, 516 
numerical inconsistencies, 555 
numerical stability, 516 
Nyquist rate, 42 
Nyquist’s criterion, 311 


observations, 261 
optimal reduced-basis representation, 131 
optimum a priori error, 594 
optimum array processing, 641 
optimum beamformer, 642, 643, 644 
effect of bandwidth, 656 
eigenanalysis, 646 
interference cancelation performance, 648 
low sidelobe, 650 
signal mismatch loss (desired signal not in 
correlation matrix), 654 
signal mismatch loss from desired signal in 
correlation matrix, 655 
spatial null depth, 648 
tapering, 649 
optimum beamforming weight vector, 643 
optimum estimate 
order decomposition, 344 
order-recursive computation, 340 
order-recursive structure, 342 
orthogonal structure, 345 
optimum estimator, 262 
optimum filters, 509 
design, 387 
frequency-domain interpretation, 285 
implementation, 388 
optimum FIR filters, 281, 295 
ladder structure, 362 
lattice structures, 361 
order-recursive algorithms, 347 
optimum IIR filters 
causal, 297 
factorization, 298 
irreducible MMSE, 299 
noise filtering, 300 
noncausal, 296 
regular input processes, 297 
white input processes, 297 


optimum learning, 514 

optimum nesting, 335 

optimum signal processing, | 

optimum signal processor, 262 
optimum space-time weight vector, 685 
orthogonal matrix, 762 

orthogonal transformation, 125 
orthogonal vectors, 755 

orthogonality principle, 273 
overdetermined LS problem, 402 


p norm, 755 
Paley-Wiener theorem, 63 
parameter vector, 265 
parametric models, 12 
parametric signal model, 150 
parametric spectrum estimation, 467-470 
Parseval’s relation, 39 
partial correlation, 344 
partial correlation coefficients, 364 
partially adaptive arrays, 673-676 
beamspace, 675 
subarrays, 675 
partition of a matrix, 759 
peak distortion, 316 
periodic extension, 198 
periodic random sequences, 132 
periodogram 
definition, 212 
filter bank interpretation, 213 
modified, 212 
persymmetric matrix, 758 
persistence, 731 
phase spectrum, 238 
estimation, 240 
Pisarenko harmonic decomposition, 482-484 
plane wave, 624 
point estimate, 133 
poles, 44 
pole-zero (PZ) model estimation, 447, 462-467 
equation error method, 463 
known excitation, 463 
nonlinear least squares, 464 
unknown excitation, 463 
pole-zero (PZ) model selection, 446 
pole-zero (PZ) model validation, 447 
autocorrelation test, 448 
power spectrum test, 448 
pole-zero (PZ) modeling 
applications, 467-471 
speech modeling, 470-471 
pole-zero models 
autocorrelation, 177 
cepstrum, 184 
first-order, 180 
impulse response, 177 
mixed representations, 189 
partial autocorrelation, 179 
poles on unit circle, 182 
spectrum, 179 
summary and dualities, 181 
pole-zero (PZ) signal modeling, 445 
pole-zero (PZ) signal models, 153, 154, 445 
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pole-zero (PZ) spectrum estimation, 467-470 


positive definite matrix, 764 
properties of, 765 

power cross-spectrum estimation, 252 

power spectral density (PSD), 110 
properties, 109, 114 

power spectrum, 37, 38 

power spectrum estimation, 195 
Blackman-Tukey method, 223 
multitaper method, 246 
nonparametric techniques, 195 
parametric techniques, 195 
practical considerations, 232 
Welch-Bartlett method, 227 

power transfer factor, 153 

power-domain LS solutions, 439 

predictable processes, 306 

predicted estimates, 511 

prewhitening, 468 

prewindowed RLS Flr filters, 573 

prewindowing, 408 

primary input, 505 

primary signal, 22 

principal component analysis, 270 

principal coordinate system, 271 

principle of orthogonality, 273 

probability density function(pdf), 76 

probability mass function (pmf), 77 

projection matrix, 402, 480-481 

propagating wave, 623 

property restoral approach, 507 

PSD; see power spectral density 

pseudo random numbers, 83 

pseudo-inverse, 402 

pseudospectrum 
eigenvector method, 485 
minimum-norm method, 487 
MUSIC, 484 
Pisarenko harmonic decomposition, 482 

pulse repetition frequency, 7 


QR decomposition, 423, 474, 560, 667 
thin, 423 
QR-decomposition RLS, 574 
QR-RLS algorithm, 564 
quadratic constraints, 673 
quadrature component, 45 
quadrature spectrum, 238 
estimation, 240 
quality of adaptation, 515 
quantization, 503 
quantization error, 504 
quiescent response, 647 


radar signals, 7 
raised cosine filters, 312 
random fractals, 15, 725 
random midpoint replacement method, 737 
random process(es) 
ensemble, 98 
ergodic, 105 
ergodic in correlation, 106 


ergodic in the mean, 106 
Gaussian, 101 
independent, 101 
independent increment, 101 
innovations representation, 151 
jointly wide-sense stationary, 103 
locally stationary, 105 
Markov, 104 
orthogonal, 101 
predictable, 99 
realization, 98 
stationary, 102 
uncorrelated, 101 
wide sense cyclostationary, 101 
wide-sense periodic, 101 
wide-sense stationary, 102 
random sequences, 98 
random signal memory, 118 
correlation length, 119 
random signal variability, 107 
random signals, 1, 3, 75 
generation, 155 
random variable(s), 75 
Cauchy, 82 
complex, 84 
continuous, 76 
discrete, 76 
independent, 84 
normal or Gaussian, 82 
orthogonal, 86 
sums, 90 
uniformly distributed, 81 
random vectors, 83 
complex, 84 
decorrelation, 343 
innovations representation, 125 
linear transformations, 87 
linearly equivalent, 343 
normal, 88 
range, 7 
rate of convergence, 515, 519 
rational models, 13 
Rayleigh’s quotient, 121, 322 
receiver, 626 
rectangularly growing memory, 593 
recursive least-squares (RLS), 548 
methods for beamforming, 670 
recursive representation, 151 
reference input, 505 
reference signal, 22 
region of convergence (ROC), 43 
reflection coefficients, 362, 364 
regression function, 396 
regression vector, 396 
relationship between minimum-variance and 
all-pole spectrum estimation, 477-478 
relative error, 762 
reverberations, 17 
right singular vectors, 432 
RLS algorithm classification, 589 
RLS lattice-ladder 
a posteriori, 582 
a priori, 583 


a priori with error feedback, 584 
Givens rotation-based, 585 
square-root free Givens, 588 
square-root Givens, 588 

RLS lattice-ladder algorithms, 580 

RLS misadjustment, 599 

root method, 62 

root-MUSIC, 485 

rotational invariance, 489-490 


sample autocorrelation sequence, 210 
sample correlation matrix, 474, 481, 660 
sample matrix inversion (SMI) adaptive 


beamformer, 660 
beam response, 666 
desired signal present, 665 
diagonal loading, 665, 666 
implementation, 667 
sidelobe levels, 661-665 
training issues, 665 


sample matrix inversion (SMI) loss, 660, 661 


sample mean, 136 
sample support, 660 
sample variance, 139 


sample-by-sample adaptive methods, 669 


sampling, 503 
sampling distribution, 134, 137 
sampling frequency, 40 
sampling period, 40 
sampling rate, 40 
sampling theorem, 42 
bandpass, 45 
DFT, 201 
scale-invariance, 725 
scale-invariant, 15 
scatter plot, 1 
seasonal time series, 184 
second characteristic function, 80 
self-similar, 15, 725 
with stationary increments 
strict-sense, 728 
wide-sense, 728 
self-similarity index, 726 
semivariogram, 728 
sensor thermal noise, 627 
Shannon number, 248 
Sherman-Morrison’s formula, 745 
shift-invariance, 347 
short memory processes, 119 
short-memory behavior, 155 
sidelobe canceler, 28, 676-678 
sidelobe target, 649 
sidelobes, 635 
signal analysis, 3 
signal filtering, 3 
signal mismatch, 652 
signal model, 34 
signal modeling, 11, 150 


signal operating environment (SOB), 24, 507 


signal prediction, 21 
signal subspace, 480-481 
signal(s), 33 
causal, 36 
classification, 35 


complex-valued, 34 795 
continuous-time, 34 
deterministic, 34 
digital, 34 
discrete-time, 34 
duration, 36 
energy, 35 
narrowband, 44 
one-dimensional, 34 
periodic, 36 
power, 35 
random, 36 
real-valued, 34 
signal-to-interference-plus-noise ratio (SINR), 643 
signal-to-noise ratio 
array, 633 
element, 633 
similarity transformation, 270 
singular value decomposition (SVD), 431, 491 
singular values, 432 
SINR maximization, 643 
sinusoidal model, 478-482 
skewness, 79 
Slepian tapers, 247 
space time-bandwidth product, 656 
space-time adaptive processing (STAP), 683-685 
space-time filtering, 683 
spatial ambiguities, 630, 635 
spatial filter, 631 
spatial filtering, 25 
spatial frequency, 630 
spatial matched filter, 634 
spatial power spectrum, 632 
spatial sampling frequency, 630 
spatial sampling period, 630 
spectral dynamic range, 124 
spectral estimation, 8 
spectral factorization, 61, 152 
spectral flatness measure, 153 
spectral norm, 757 
spectral synthesis method, 736 
spectral theorem, 122 
spectrum estimation 
Capon’s method, 472 
data-adaptive, 472 
deterministic signals, 196 
maximum entropy method, 460-461 
minimum variance, 471-478 
parametric, 467-470 
pole-zero models, 467-470 
relationship between minimum-variance and 
all-pole methods, 477-478 
spectrum sampling, 199 
spectrum splitting, 457 
speech modeling, 470-471 
speech signals, 4 
speed of adaptation, 515 
spherically invariant random processes (SIRP), 528 
square matrix 
cofactors, 760 
determinant, 760 
stability, 518 
bounded-input bounded-output (BIBO), 48 
test, 69 
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stable distribution(s), 10, 93 

standard deviation, 79 

standardized cumulative periodogram, 448 
statistical signal processing, | 
statistically self-affine, 726 
statistically self-similar, 15,725 
steepest descent algorithm (SDA), 517 
steepest descent methods for beamforming, 670 
steered response, 632 

steering vector, 634 

step-size parameter, 517 

stochastic convergence, 513 

stochastic process, 98 

stochastic processes, self-similar, 725 
stochastic signals, 3 

strict white noise, 110 

Student’s ¢ distribution, 138 
subarrays, 675 

subband adaptive filtering, 548 
subspace techniques, 478-493 

sum beamformer, 680 

sum of squared errors (SSE) criterion, 264 
superladder, 373 

superlattice, 370 

superresolution, 478, 682 
superrsolution, 28 

supervised adaptation, 507 

symbol equalizer (SE), 315 

symbol interval, 20, 311 

symmetric a-stable, 94 

symmetric linear smoother, 288 
synchronous equalizer, 315 

synthesis filter, 152 

system function, 49 

system identification, 11, 17 

system inversion, 19 

system modeling, 11 

system-based signal model, 151 


tapered conventional beamforming, 638 

tapering, 197 

tapering loss, 639, 650 

target signal, 7 

thinned arrays, 636 

time average, 106 

time dispersion, 502 

time series, 1, 98 

time-bandwidth product, 628 

time-delay steering, 657 

Toeplitz matrix, 48, 123 
inversion, 377 
triangularization, 374 

tracking mode, 507 

training sequence, 21 


training set, 396 
transform-domain LMS, 547 
trispectrum, 693 


uniform linear array (ULA), 25, 624 
unit gain on noise normalization, 644 
unit impulse, 35 

unit sample response, 35, 47 

unit step sequence, 35 

unitary complex space, 755 

unitary matrix, 762 

unitary transformation, 762 
unsupervised adaptation, 507 
unsupervised adaptive filters, 703 


Vandermode matrix, 121 
variance, 79, 100 
vectors 
angle between, 756 
linearly independent, 756 
orthonormalized, 756 
vocal tract, 13 


wavelength, 624 
well (ill) conditioned matrix, 762 
white noise, 110 
whitening, 304 
whitening filter, 152 
whitening transformation, 126 
wideband interference, 656 
wideband steering vector, 656 
Wiener filters, 278 
Wiener-Hopf equations, 282 
windowing, 197, 198, 408 
windows 

Dolph-Chebyshev, 208 

Hamming, 206 

Kaiser, 207 

rectangular, 206 
Wishart distribution, 555 
Wold decomposition, 156 
Woodbury’s formula, 746 


Yule-Walker equations, 160, 164 


zero padding, 199, 201 
zero-forcing equalizer, 316 
zeros, 44 

z-transform, 43 


